Giter Site home page Giter Site logo

voxceleb_trainer's Introduction

VoxCeleb trainer

This repository contains the framework for training speaker recognition models described in the paper 'In defence of metric learning for speaker recognition' and 'Pushing the limits of raw waveform speaker recognition'.

Dependencies

pip install -r requirements.txt

Data preparation

The following script can be used to download and prepare the VoxCeleb dataset for training.

python ./dataprep.py --save_path data --download --user USERNAME --password PASSWORD 
python ./dataprep.py --save_path data --extract
python ./dataprep.py --save_path data --convert

In order to use data augmentation, also run:

python ./dataprep.py --save_path data --augment

In addition to the Python dependencies, wget and ffmpeg must be installed on the system.

Training examples

  • ResNetSE34L with AM-Softmax:
python ./trainSpeakerNet.py --config ./configs/ResNetSE34L_AM.yaml
  • RawNet3 with AAM-Softmax
python ./trainSpeakerNet.py --config ./configs/RawNet3_AAM.yaml
  • ResNetSE34L with Angular prototypical:
python ./trainSpeakerNet.py --config ./configs/ResNetSE34L_AP.yaml

You can pass individual arguments that are defined in trainSpeakerNet.py by --{ARG_NAME} {VALUE}. Note that the configuration file overrides the arguments passed via command line.

Pretrained models

A pretrained model, described in [1], can be downloaded from here.

You can check that the following script returns: EER 2.1792. You will be given an option to save the scores.

python ./trainSpeakerNet.py --eval --model ResNetSE34L --log_input True --trainfunc angleproto --save_path exps/test --eval_frames 400 --initial_model baseline_lite_ap.model

A larger model trained with online data augmentation, described in [2], can be downloaded from here.

The following script should return: EER 1.0180.

python ./trainSpeakerNet.py --eval --model ResNetSE34V2 --log_input True --encoder_type ASP --n_mels 64 --trainfunc softmaxproto --save_path exps/test --eval_frames 400  --initial_model baseline_v2_smproto.model

Pretrained RawNet3, described in [3], can be downloaded via git submodule update --init --recursive.

The following script should return EER 0.8932.

python ./trainSpeakerNet.py --eval --config ./configs/RawNet3_AAM.yaml --initial_model models/weights/RawNet3/model.pt

Implemented loss functions

Softmax (softmax)
AM-Softmax (amsoftmax)
AAM-Softmax (aamsoftmax)
GE2E (ge2e)
Prototypical (proto)
Triplet (triplet)
Angular Prototypical (angleproto)

Implemented models and encoders

ResNetSE34L (SAP, ASP)
ResNetSE34V2 (SAP, ASP)
VGGVox40 (SAP, TAP, MAX)

Data augmentation

--augment True enables online data augmentation, described in [2].

Adding new models and loss functions

You can add new models and loss functions to models and loss directories respectively. See the existing definitions for examples.

Accelerating training

  • Use --mixedprec flag to enable mixed precision training. This is recommended for Tesla V100, GeForce RTX 20 series or later models.

  • Use --distributed flag to enable distributed training.

    • GPU indices should be set before training using the command export CUDA_VISIBLE_DEVICES=0,1,2,3.

    • If you are running more than one distributed training session, you need to change the --port argument.

Data

The VoxCeleb datasets are used for these experiments.

The train list should contain the identity and the file path, one line per utterance, as follows:

id00000 id00000/youtube_key/12345.wav
id00012 id00012/21Uxsk56VDQ/00001.wav

The train list for VoxCeleb2 can be download from here. The test lists for VoxCeleb1 can be downloaded from here.

Replicating the results from the paper

  1. Model definitions
  • VGG-M-40 in [1] is VGGVox in the repository.
  • Thin ResNet-34 in [1] is ResNetSE34 in the repository.
  • Fast ResNet-34 in [1] is ResNetSE34L in the repository.
  • H / ASP in [2] is ResNetSE34V2 in the repository.
  1. For metric learning objectives, the batch size in the paper is nPerSpeaker multiplied by batch_size in the code. For the batch size of 800 in the paper, use --nPerSpeaker 2 --batch_size 400, --nPerSpeaker 3 --batch_size 266, etc.

  2. The models have been trained with --max_frames 200 and evaluated with --max_frames 400.

  3. You can get a good balance between speed and performance using the configuration below.

python ./trainSpeakerNet.py --model ResNetSE34L --trainfunc angleproto --batch_size 400 --nPerSpeaker 2 

Citation

Please cite [1] if you make use of the code. Please see here for the full list of methods used in this trainer.

[1] In defence of metric learning for speaker recognition

@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Proc. Interspeech},
  year={2020}
}

[2] The ins and outs of speaker recognition: lessons from VoxSRC 2020

@inproceedings{kwon2021ins,
  title={The ins and outs of speaker recognition: lessons from {VoxSRC} 2020},
  author={Kwon, Yoohwan and Heo, Hee Soo and Lee, Bong-Jin and Chung, Joon Son},
  booktitle={Proc. ICASSP},
  year={2021}
}

[3] Pushing the limits of raw waveform speaker recognition

@inproceedings{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  booktitle={Proc. Interspeech},
  year={2022}
}

License

Copyright (c) 2020-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

voxceleb_trainer's People

Contributors

dvisockas avatar joonson avatar jungjee avatar lbjcom avatar msh9184 avatar trellixvulnteam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

voxceleb_trainer's Issues

Question about trade-off between batch size and max_frames

Hi. Thanks for the good resource!!

In your paper, the performance of the metric learning-based model is said to be affected by the batch size. I have two questions.

  1. I'm wondering if max_frames and max_seg_per_spk also affect performance.
  2. I would like to increase the batch size, but there is not enough GPU memory.
    After seeing comments that multi GPU (Dataparallel) is not recommended, I am asking if reducing the max_frames to increase the batch size is the right way.

Is `test_list` flag intended to point to "the test list for VoxCeleb1"?

As README.md describes,

The train list for VoxCeleb2 can be download from [here](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/train_list.txt) and the
test list for VoxCeleb1 from [here](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test.txt).

the name of test list file for VoxCeleb1 is veri_test.txt not test_list.txt.
Does test_list flag aim to point out that file?

Pretrained models for AM/AMS-Softmax

Hello. Thank you for sharing the code!
I have my own test set for speaker verification (there are 2 different test sets, I tried both, the problem occurs in both). I am having some troubles with evaluating the models pre-trained on metric learning(AP) to my test sets - the problem is that EER is too high (22% and 30%). I used your pretrained model on fast ResNet & thin ResNet.
However, when tested on the model from Utterance-level Aggregation For Speaker Recognition In The Wild (Thin Resnet34) with classification objective (Softmax), the EER for my data is significantly lower (10 and 8%).
The format I am passing is the same (label(0/1), wav1, wav2).
I was wondering if you could share the models pre-trained on classification objectives (AM/AMS-Softmax) publicly, so that I can evaluate my data on them.

Implementación

Him thank you very much for sharing, I want to ask you if I can deliver an audio to generate that fingerprint and then validate with that id and a new audio if it is indeed the person who is communicating, to make a biometric voice authentication system

my score file has ONLY NEGATIVE score....

Hello,
I think this github page is very closely related to VOXSRC 2020....
In VoxSRC 2020 development toolkit, it requires score file, and on the development toolkit website,
" score; must be a floating point number between the closed interval [0, 1], where 1 means the pair of segments correspond to the same speaker and 0 means the pair of segments correspond to different speakers."

but in the code, when I evaluate my model, the written score file has ALL NEGATIVE score, not between [0, 1].
my score file starts with
-0.6494 id10270/x6uYqmx31kE/00001.wav id10270/8jEAjG6SegY/00008.wav
-1.3724 id10270/x6uYqmx31kE/00001.wav id10300/ize_eiCFEg0/00003.wav
-0.7400 id10270/x6uYqmx31kE/00001.wav id10270/GWXujl-xAVM/00017.wav
-1.2865 id10270/x6uYqmx31kE/00001.wav id10273/0OCW1HUxZyg/00001.wav
-0.6512 id10270/x6uYqmx31kE/00001.wav id10270/8jEAjG6SegY/00022.wav
-1.1112 id10270/x6uYqmx31kE/00001.wav id10284/Uzxv7Axh3Z8/00001.wav
-0.6810 id10270/x6uYqmx31kE/00001.wav id10270/GWXujl-xAVM/00033.wav
-1.0992 id10270/x6uYqmx31kE/00001.wav id10284/7yx9A0yzLYk/00029.wav
-0.7889 id10270/x6uYqmx31kE/00002.wav id10270/5r0dWxy17C8/00026.wav
-1.1319 id10270/x6uYqmx31kE/00002.wav id10285/m-uILToQ9ss/00009.wav

Is there any problem???
Is everything OK? with this code to participate in VOXSRC 2020?
Thank you.

Pairwise for 10 features when evaluation

May I know what is the intuition using pairwise distance of 10-features when doing evaluation instead of using cosine-similarity. I have tried cs, but performance is worse.

Pretrained models for GE2E

HI, Thank you for sharing the code and pre-trained models :)

I'm working with multi-speaker speech synthesis and I would like to know if it is possible for you to make the ResNetSE34L model trained with GE2E available, using the same parameters as the model provided trained with Angular Prototypical (EER 2.2322)

Embedding of GE2E-trained models are used to represent a speaker on multi-speaker synthesis systems, so I want to compare Angular Prototypical loss performance with GE2E in multispeaker speech synthesis :).

Problems in converting voxceleb

Hello, I'm having problems when using dataprep.py to convert aac files,
I using this script

D:\VGGvox\VGGvoxtrainer\voxceleb_trainer>python ./dataprep.py --save_path ./voxceleb --convert
and this error show up

Converting files from AAC to WAV
0%| | 0/1091606 [00:00<?, ?it/s]
The system cannot find the path specified.
0%| | 0/1091606 [00:00<?, ?it/s]
Traceback (most recent call last):
File "./dataprep.py", line 142, in
convert(args)
File "./dataprep.py", line 113, in convert
raise ValueError('Conversion failed %s.'%fname)
ValueError: Conversion failed ./voxceleb/voxceleb2\id00012\21Uxsk56VDQ\00001.m4a.

I have been trying to figure this for days
Thanks you very much

A problem about prototypical loss.

Hello,
if M = 2, then is it a one-shot learning using the prototypical loss?
The prototype of a class is just an embedding of a utterance?
thank you very much!

Cannot Find Dev/Test Lists

Where can we find the dev/test lists for voxceleb 1 and 2?

Based on /home/joon/voxceleb/test_list.txt in the README, it seems like it should have been extracted with dataprep.py. But this wasn't the case after I went through the download/extract/convert steps.

Contrastive Confusion

   ` ## loss functions
    if self.loss_func == 'contrastive':
        nloss       = torch.mean(torch.cat([torch.pow(pos_dist, 2),torch.pow(F.relu(self.margin - neg_dist), 2)],dim=0))
    elif self.loss_func == 'triplet':
        nloss   = torch.mean(F.relu(torch.pow(pos_dist, 2) - torch.pow(neg_dist, 2) + self.margin))

    scores = -1 * torch.cat([pos_dist,neg_dist],dim=0).detach().cpu().numpy()

    errors = tuneThresholdfromScore(scores, labelnp, []);

    return nloss, errors[1]

`
Why there are three inputs for contrastive ?

About nSpeakers setting

Hi,thanks for your sharing!
I have a question.
When training example, should I set nSpeakers as 5594 ? Because the number of speakers in VOXCELEB2 DEV is 5994. I noticed the defaulted nSpeakers is 6200.

parser.add_argument('--nSpeakers', type=int, default=6200, help='Number of speakers in the softmax layer for softmax-based losses, utterances per speaker per iteration for other losses');

python ./trainSpeakerNet.py --model ResNetSE34 --encoder SAP --trainfunc amsoftmax --optimizer adam --save_path data/exp1 --batch_size 200 --max_frames 200 --scale 30 --margin 0.3 --train_list /home/joon/voxceleb/train_list.txt --test_list /home/joon/voxceleb/test_list.txt --train_path /home/joon/voxceleb/voxceleb2 --test_path /home/joon/voxceleb/voxceleb1

Question about contrastive loss input

   `for findex, key in enumerate(dictkeys):
        data    = self.data_dict[key]
        numSeg  = round_down(min(len(data),self.max_seg_per_spk),self.gSize) 
        
        rp      = lol(numpy.random.permutation(len(data))[:numSeg],self.gSize)
        
        flattened_label.extend([findex] * (len(rp)))
        for indices in rp:
            flattened_list.append([data[i] for i in indices])`

It seems that all of input pairs are from the same person e.g (p1,p1),(p2,p2),(p3,p3). Is there an input pair containing utterances from different person such as (p1,p2)? Thanks!

Fast ResNet-34 specifications

In the paper, you mention that "Due to space constraints, the exact specification [for Fast ResNet-34] can be found in the accompanying code." I can't find this specification in the repo. Can you point me to the right place?

I need not only test accuracy but test loss also...

When i train with default setting, VEER is calculated for every 10 iteration, and I regard VEER as Validation Equal Error Rate.
thus, VEER is derived from test accuracy and it is very important measure.
But I need not only VEER but Validation Loss also to pick the best model.
How could I print Validation loss?

Thank you...

nSpeaker affects M?

I'm a bit confused about how data in a minibatch is orgnanized.

When I call the training script like this:

python trainSpeakerNet.py ... --trainfunc angleproto --batch_size 64 ... --nSpeakers 3 ...

and examine the input x of AngleProtoLoss.forward, its shape is [64,3,512]. If I change nSpeakers to 2, it's [64,2,512].

Isn't the second dimension number of utterances per speaker? I don't understand why it's equal to nSpeakers. Unless I'm mistaken about prototypical loss, this isn't correct because we need to compute the centroid by taking the mean across different utterances for the same speaker, but we're instead computing the mean across different speakers.

Questions about the test list

Hello, I would like to ask a few questions about the test list

  1. What does 1 or 0 in the first column of the test list mean?
  2. There are two voice tags on each line in the test list. Do I have to set this up?
  3. If I use my own data set, do I need to arrange it according to the training list and test list?

train_list file not found

Hi,

I just finished the data preparation steps and about to start train the model.

The readme shows we can train the model by:

python ./trainSpeakerNet.py --model ResNetSE34 --encoder SAP --trainfunc amsoftmax --optimizer adam --save_path data/exp1 --batch_size 200 --max_frames 200 --scale 30 --margin 0.3 --train_list /home/joon/voxceleb/train_list.txt --test_list /home/joon/voxceleb/test_list.txt --train_path /home/joon/voxceleb/voxceleb2 --test_path /home/joon/voxceleb/voxceleb1

However, I find the test_list.txt and train_list.txt in the data folder. Should we generate these two files by our own or did I do something wrong when I processing the data?

Really appreciate your help.

Missing Details in README and code

  1. In paper you've mentioned,

For each epoch, we sample a maximum of 100 utterances from each of the 5,994 identities

In code that propagates as value of nSpeakers args parameter for angular prototypical. To derive result shared in paper what value of nSpeakers is used for non softamx losses?

  1. Both prototypical and angular prototypical, has hyperparmeter M. How to set that parameter in code ?(For other losses there is option to pass values m and s in args).

  2. This is related approach for evaluation and does not require correction: You have used Euclidean norm for similarity. Many earlier work on speaker verification including voxceleb used cosine similarity. I evaluated result with cosine similarity and EER comes to 2.2693 . Slightly higher but not by large. Any reason to use Euclidean norm over cosine similarity?

Why using different input features

Hello, just want to enquire for why using different input features for different network (Mel Spectrogram for VGG, Spectrogram for ResNet)? And which is better?
I have tried to use Spectrogram in VGG, and in my experiment, there is not much difference regarding the final performance comparing to Mel Spectrogram.

Can't replicate the results

Hi! I trained the model using the parameters provided and got
IT 500, LR 0.000081, TEER/T1 66.35, TLOSS 1.446545, VEER 2.4920
with the best one on:
IT 480, LR 0.000090, TEER/T1 66.39, TLOSS 1.444428, VEER 2.4390

I used the following command:
python ./trainSpeakerNet.py --model ResNetSE34L --encoder SAP --trainfunc angleproto --optimizer adam --save_path data/exp1 --batch_size 800 --max_frames 200

And using the provided meta_files for training (5994 speakers) and testing.

Am I doing something wrong?

Fast ResNet-34?

Hi, I didn't find Fast ResNet-34 in the paper. After comparing it with the Thin ResNet-34 in the code, are the differences between them only the input features(Spectrogram for Thin ResNet-34, MelSpectrogram for Fast ResNet-34) and the lack of maxpool layer in Fast ResNet-34?

Consistent metric for training and testing

Hi,

Thank you for great work.

I think the implementation has little flaw. Would it be more consistent to use same metric for training and test? Current implementation uses Euclidean distance whatever metric used in training.

dist = F.pairwise_distance(ref_feat.unsqueeze(-1).expand(-1,-1,num_eval), com_feat.unsqueeze(-1).expand(-1,-1,num_eval).transpose(0,2)).detach().cpu().numpy();

Best,
Hwidong

Any suggestions on reproduce?

Thanks for your extensive experiment! Maybe I guess the success to a low eer is "training for 500 epoches". I believe it is diffcult for most people to reproduce that due to limit of computing resources.
Do you have any suggestions on those people? I think that should be extremely precious!

about AMsoftmax

Hi, I use same coda, just different loss function (softmax vs AMsoftmax). But the loss is reducing slowly and acc is increase slowly when i use amsoftmax . Have you ever meet this problem?

Regarding 500 training epochs

Just curious, in most deep learning program, I just train the model for less than 100 epochs or in most cases less than 50 epochs. What is the intuition you train for 500 epochs?

Incorrect argument in readme.

In the data preparation script, the example command for download has --username instead of --user as coded in dataprep.py

Convergence

How come this model take so much epochs to train? It is a bit slow to converge

Comparison with your earlier paper

I need some clarity on current implementation.

  1. There is difference in thin resnet mentioned in your earlier work vs what is implemented here. From number of blocks in each layers to number of filters in different blocks and layers. Resulting in model size to reduce to great extent. Parameters count reduced from over 8M parameter to less than 1.5 M. Is there any paper that details thin resntet, how do you go about selecting number of blocks and number of filters in each block?

  2. In earlier paper you used VLAD and GHOST layers which are removed here. Do you find them unnecessary with current implementation?

  3. This one is not related to old paper but I am bit confused at the moment with current implementation. TEER means different for different type of losses. In case of angleproto loss, when you get optimized result (least value of VEER) as shown paper, what value of TEER (which I think is accuracy) do you get?
    I am training model using only voxceleb1 training data and I am getting following result at 400 epochs:

    TEER 70.15, TLOSS 1.085345, VEER 1.8081

    I am surprised with this result as VEER here is already less than what is mentioned in paper, so something doesn't seem quite right here. Any suggestion?

a problem on finetune

Hi, I want to train a robust model with voxceleb2 original data and voxceleb2 augmentation data (might add any data later). Could you give some advice about which is a better way to do it, train from scratch or finetune the voxceleb_trainer baseline model? (I tried to finetune, but get worse result than baseline model)

MelSpectrum and stft

I noticed in ResNetSE34L, you are using MelSpectrum while in ResNet34SE, you are using stft. Did you achieve similar EER for both?

Experimental results

Hello, thank you very much for your extensive experiment. I have a question for you.
In the experimental results, both GE2E and Prototypical Angular Prototypical use M (2, 3, 4 5) for comparison test, how do you set this M in the code?

how do I train angular-prototypical loss for M=2?

I tried nSpeakers=2 to make gsize_dict['angleproto'] == 2,
but lowest VEER is about 2.58% until the 400 Epoch with training batch size 400 (--nSpeakers=2, --batch_size=200)

how do I train angular-prototypical loss for M=2??
Thank you...

Regarding negative similarity score

When I use angleproto loss to train model, I see cosine similarity values coming negative for some of pairs with label 0. I haven't some across such cases with softmax or amsoftmax losses. Is there any explanation why that might be happening?

Angular Prototypical and # of classes

Does it consider all classes(5994) in a mini-batch? I am confused of this because if a mini-batch size is only 800, then maximum number of classes is just 800?

don't we need tuneThreshold.py?

If I need only min DCF and i dont need EER any more,
don't I need tuneThreshold.py any more???
I just guess tuneThreshold is needed when calculating EER...

Thank you.

Regarding SAP implemantaion

@joonson Would you provide, some reference (paper/github) using which SAP is implemented? I have hard time understanding what is being done.

One such question is why non-linearity is used for this projection?
torch.tanh(self.sap_linear(x))

Also , does [self.attention] gets trained as part of training?

I made changes to convert it to multi-head vs single one as implemented here, even though code works and I can train model I am not seeing improvement. Having better understanding would help. Thanks.

Overfitting using VGG?

Hello, I trained with the VGG40 network, the results and setting of hyper parameters are in scores.txt.
The EER and TEER(which I think is accuracy) after 10 epochs is 10.67% and 71.71% respectively.
The TEER becomes higher with the epochs increasing, but the EER is 12.02% after 20 epochs and 12.58% after 30 epochs.
Does this means overfitting?
scores.txt

AttributeError: module 'torch' has no attribute 'finfo'

Once I use "ResnetSE34L", it shows:
File "/home/user/anaconda3/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py", line 23, in
EPSILON = torch.tensor(torch.finfo(torch.float).eps, dtype=torch.get_default_dtype())
AttributeError: module 'torch' has no attribute 'finfo'

Sorry,I have two problems

@joonson Hi,I have two problems bellow:

  1. Do you run it on multi-gpus? I dit it ,but I get Segmentation fault.Although I can run it on single gpu,but i think it is too slow.
  2. gsize_dict = {'proto':args.nSpeakers, 'triplet':2, 'contrastive':2, 'softmax':1, 'amsoftmax':1, 'aamsoftmax':1, 'ge2e':args.nSpeakers, 'angleproto':args.nSpeakers}.
    In the paper,
    For 'proto'/'angleproto' ,That nSpeakers is 2 can get best result.
    For 'ge2e' That nSpeakers is 3 can get best result.
    Maybe the nSpeakers changes with the different datasets?

Comparing to your previous paper(VoxCeleb2)

In your previous VoxCeleb2 paper, the best EER on VoxCeleb1 is 2.87%, whose model applies NetVLAD/GhostVLAD. But here you apply SAP. Currently ignoring Angular Prototypical loss, you also get a good result (2.36) using CosFace/ArcFace, which is better than 2.87% too. What do you think the factor that leads to this better result? SAP or training for 500 epochs ? The title of the paper is "In denfense of metric learning", I guess you want to assert more on the best result given by Angular Prototypical loss, but the thing is Cosface/ArcFace also achieves close result(2.36%).

issues with using triplet as loss function to train the model

Hi,

I am trying to train a model using below command, but the program seems to be stuck. I am new in DNN, could you please give me some hint about the parameters. I fetched audio file info of only the first 100 speakers from the train.list and test.list and arranged them as train_list_100speakers.txt and
veri_test_100speakers.txt down below.

From the paper, I thought I should give --nSpeakers the value 2, and give --batch_size 100 if I would like to achieve batch size as 200 in the paper.

Below is the training command I used:

python ./trainSpeakerNet.py --model ResNetSE34L --encoder SAP --trainfunc triplet --optimizer adam --save_path data/exp3 --nSpeakers 2 --batch_size 100 --max_frames 200 --scale 30 --margin 0.3 --train_list ../voxceleb_data/train_list_100speakers.txt --test_list ../voxceleb_data/veri_test_100speakers.txt --train_path ../voxceleb_data/voxceleb2 --test_path ../voxceleb_data/voxceleb1 > nohup_triplet.txt 2>&1

However, I got errors like below a lot.

2020-05-28 11:52:45 24 Training ResNetSE34L with LR 0.000902...
Processing (400/490) Loss 0.247639 EER/T1 51.500% - 350.12 Hz Q:(0/10)
2020-05-28 11:52:47 LR 0.000902, TEER 51.50, TLOSS 0.247639

2020-05-28 11:52:47 25 Training ResNetSE34L with LR 0.000902...
Processing (400/499) Loss 0.259803 EER/T1 55.000% - 349.58 Hz Q:(0/10)Exception in thread Thread-10:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/external/voxceleb/voxceleb_trainer/DatasetLoader.py", line 103, in dataLoaderThread
    feat.append(loadWAV(self.data_list[ij][ii], self.max_frames, evalmode=False));
IndexError: list index out of range

Could you please let me know where the index error come from? I guess it is from the --batch size part but don't have a clue on how to solve it

Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.