Giter Site home page Giter Site logo

jungjee / rawnet Goto Github PK

View Code? Open in Web Editor NEW
349.0 14.0 56.0 180 MB

Official repository for RawNet, RawNet2, and RawNet3

License: MIT License

Python 99.51% Shell 0.49%
speaker-embeddings speaker-verification pytorch voxceleb2 rawnet extracted-speaker-embeddings spk-embd

rawnet's Introduction

Overview

This repository includes implementations of speaker verification systems that input raw waveforms.

Currently, it includes four systems in python. Detailed instructions on each system are described in individual ReadME files.

RawNet3 in ESPnet

As a part of an open-source project, ESPnet-SPK, pre-trained RawNet3 using the ESPnet-SPK framework is supported for easy access. Albeit the same architecture, with an enhanced framework, the performance has further improved slightly.

  • Performance
    • Vox1-O: EER 0.73%

Usage

As mentioned in Figure 3 of the ESPnet-SPK paper, the below few lines of code are sufficient to extract RawNet3 embeddings. Refer to the code snippet below and replace np.zeros with your raw waveform.

  • ESPnet installation is a prerequisite
import numpy as np 
from espnet2.bin.spk_inference import Speech2Embedding

speech2spk_embed = Speech2Embedding.from_pretrained(model_tag="espnet/voxcelebs12_rawnet3")
embedding = speech2spk_embed(np.zeros(16500)) 		 

ESPnet-SPK is currently on arXiv.

@article{jung2024espnet,
  title={ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models},
  author={Jung, Jee-weon and Zhang, Wangyou and Shi, Jiatong and Aldeneh, Zakaria and Higuchi, Takuya and Theobald, Barry-John and Abdelaziz, Ahmed Hussen and Watanabe, Shinji},
  journal={arXiv preprint arXiv:2401.17230},
  year={2024}
}

RawNet3

  • PyTorch implementation
  • Performance
    • supervised learning with AAM-Softmax: EER 0.89%
    • self-supervised learning: EER 5.40%
  • Training recipe
  • Inference
    • Pre-trained weight parameters are stored in HuggingFace and is included as a submodule.
    • Vox1-O benchmark is available in RawNet3.
    • Extracting speaker embedding from any 16k 16bit mono utterance is supported.
  • Published as a conference paper in Interspeech 2022.
@article{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  journal={Proc. Interspeech},
  year={2022}
}

RawNet2_modified

  • Code refactoring
  • Performance
    • EER 1.91%
      • Trained using VoxCeleb2
      • VoxCeleb1 original trial
    • Will be used as a baseline system for authors' future works

RawNet2

@article{jung2020improved,
  title={Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms},
  author={Jung, Jee-weon and Kim, Seung-bin and Shim, Hye-jin and Kim, Ju-ho and Yu, Ha-Jin},
  journal={Proc. Interspeech},
  pages={3583--3587},
  year={2020}
}

RawNet

@article{jung2019RawNet,
  title={RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification},
  author={Jung, Jee-weon and Heo, Hee-soo and Kim, ju-ho and Shim, Hye-jin and Yu, Ha-jin},
  journal={Proc. Interspeech},
  pages={1268--1272},
  year={2019}
}

rawnet's People

Contributors

jungjee avatar kimho1wq avatar polestvr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rawnet's Issues

Document the DB directory structure

For people who don't want to use VoxCeleb + VoxCeleb2, it is hard to figure out what the directory structure should be for DB. Could you please document it?

Or even nicer, if there were a simple to download audio dataset (e.g. from torchaudio) that the script would lay out in the right way, people could immediately try your repo and see if it works on their GPU.

Too long IO time

I tried to train RawNet2 on VoxCeleb2 dataset with default settings. But I found that one epoch on average takes about 2.5 hours. By observing GPU activity, I found that most of the time GPU is waiting for data IO.
To reduce GPU waiting time, I tried to increase args "num_workers" and "prefetch_factor" in PyTorch DataLoader,but the data loading time just did not decrease.
My hardware: 3*RTX3090,128G Memory,HDD disk.

  1. I wonder if you used a SSD when training the network? And how long one epoch takes when you train RawNet2?
  2. Do you have some advice on decreasing IO time?
    Thanks.

Missing .txt trial files

Hi, I would like to use your rawNet code for training and am testing it out on the voxCeleb dataset first. However, for VoxCeleb1 I do not see the veri_test.txt file. For VoxCeleb2, I only see list_test_all_cleaned.txt and list_test_hard_cleaned.txt instead of the 6 txt files in the filetree.

Could I check if this is supposed to be the case or are there missing files?

RawNet_weights.h5

I'm wondering what network RawNet_weights.h5 provides weights for. Would you be able to provide the best model weights for both nets?

Misbehaving losses while training RawNet1

Hey, @Jungjee I was trying to fine-tune RawNet1 and observed that the center-loss does not decrease as the training proceeds, although the total-loss does go down. I tried to increase the weight of the center-loss but still see similar patterns. If the weight is too high, the spk-basis-loss starts to increase. So, it seems that both these losses are somewhat inversely related. Did you observe similar trends? I have attached the plots for the losses for different values of the c_lambda parameter (which is the weight of the center-loss). Is there something that could be going wrong here? In all the cases above the learning-rate has been set to 0.001.

c_lambda=0.001(default) c_lambda=0.1
c_lambda=0.001 c_lambda=0.1
c_lambda=0.5 c_lambda=5
c_lambda=0.5 c_lambda=5

Error in PreEmphasis Class

I am getting the following error in the PreEmphasis class:

RawNet3/RawNetBasicBlock.py", line 20, in forward
len(input.size()) == 2
AttributeError: 'builtin_function_or_method' object has no attribute 'size'

Could you please help?

Unable to train RawNet1 using Keras

I am facing this error which trying to run the 01-trn_RawNet.py script.

Traceback (most recent call last):
  File "01-trn_RawNet.py", line 279, in <module>
    loss, loss1, loss2, acc1, acc2 = model.train_on_batch([x, y], [dummy_y, dummy_y])
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 918, in train_on_batch
    outputs = self.train_function(ins)  # pylint: disable=not-callable
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3510, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 572, in __call__
    return self._call_flat(args)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 671, in _call_flat
    outputs = self._inference_function.call(ctx, args)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 445, in call
    ctx=ctx)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Cannot update variable with shape [] using a Tensor with shape [4], shapes must be equal.
         [[node metrics/s_bs_loss_accuracy/AssignAddVariableOp (defined at 01-trn_RawNet.py:279) ]] [Op:__inference_keras_scratch_graph_16089]

Function call stack:
keras_scratch_graph

Can you please help me out? I am using tensorflow 2.0.0-beta1. I am using a batch size of 4.

Config file

It would of great help if you can provide the config yaml file that you used to train the pytorch model?

can you share code to load the pretrained model?

I am having difficulties loading the pretrained model. tried both model_RawNet2_original_code and model_RawNet2

from model_RawNet2 import RawNet2
from parser import get_args
import sys
import torch

sys.argv = ['RawNet-Pytorch.ipynb'] + ['-name'] + ['Rawnet']
args = get_args()
args.model['nb_classes'] = 6112

model = RawNet2(args.model)
model.load_state_dict(torch.load('./Pre-trained_model/rawnet2_best_weights.pt'))
model.eval()
RuntimeError: Error(s) in loading state_dict for RawNet2:
	Missing key(s) in state_dict: "block0.0.frm.fc.weight", "block0.0.frm.fc.bias", "block1.0.frm.fc.weight", "block1.0.frm.fc.bias", "block2.0.frm.fc.weight", "block2.0.frm.fc.bias", "block3.0.frm.fc.weight", "block3.0.frm.fc.bias", "block4.0.frm.fc.weight", "block4.0.frm.fc.bias", "block5.0.frm.fc.weight", "block5.0.frm.fc.bias". 
	Unexpected key(s) in state_dict: "fc_attention0.0.weight", "fc_attention0.0.bias", "fc_attention1.0.weight", "fc_attention1.0.bias", "fc_attention2.0.weight", "fc_attention2.0.bias", "fc_attention3.0.weight", "fc_attention3.0.bias", "fc_attention4.0.weight", "fc_attention4.0.bias", "fc_attention5.0.weight", "fc_attention5.0.bias". 

code with model_RawNet2_original_code

from model_RawNet2_original_code import RawNet
model2 = RawNet(args.model, 'gpu')
model2.load_state_dict(torch.load('./Pre-trained_model/rawnet2_best_weights.pt'))
model2.eval()

RuntimeError: Error(s) in loading state_dict for RawNet:
	Unexpected key(s) in state_dict: "block2.0.conv_downsample.weight", "block2.0.conv_downsample.bias". 
	size mismatch for block2.0.bn1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for block2.0.bn1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for block2.0.bn1.running_mean: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for block2.0.bn1.running_var: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for block2.0.conv1.weight: copying a param with shape torch.Size([256, 128, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3]).

Can you please share a reproducible code for loading the model?

centre loss

where is the center loss component I cant find that

Overfitting on VoxCeleb

Hi.
I tried pre-trained RawNet on VoxCeleb and it's works better than ECAPA-TDNN pre-trained on VoxCeleb. But when i'm switching to another dataset on English(without training) or other languages it's works worst than ECAPA-TDNN, usually 2-4 times (EER). Do you have any ideas why it's happing? Thanks!

how to evaluate your implementation with a different dataset

Hi!

I'm trying to make an evaluation of your implementation with a different dataset ( therefore with different test_path and test_list). What is the correct way to do so without modifying the default paths in trainSpeakerNet.py?
I mean, what the command to evaluate your implementation should look like? Something like this? -->
python ./trainSpeakerNet.py --test_path /path_to_test --test_list /path_to_the_list

Sorry for my ignorance, I'm new at programming in python and neural networks.
Looking forward to your response,
Thank you!

how to use it for speaker verification

Hi JungJee,

After I trained the models, I want to see use it for speaker verification. I got a test set, say N speakers, and each has (M-enroll utterances, and L-test utterances).
Should I just enroll the N speakers, using M utterances, and then for each speaker N's each utterance in L-test, calculate the scores against the N speakers, and select the speaker who score is the highest?

Thanks,
Willy

Pytorch scripts?

Any timeline on when the scripts will be uploaded?

Thank you!

What is the requirement in terms of hardware?

First of all thanks for such amazing work of RawNet and its variants!

I just want to know about the hardware requirements for training RawNet2 and modified RawNet2. I have following two questions:

  1. I tried to run it with a single GPU with 12GB memory and failed. What was your experimental hardware setting?
  2. Is it recommended to reduce number of mini batches for better memory handling?

Directory tree of files

Can you please in the readme show the directory tree of VoxCeleb1 files that were used in your experiment? I'm a bit confused when looking at 00-pre_process_waveforms.py.

Can I feed the 22050 sr wav to the pre-trained rawnet3 model ?

Hi
I want to use rawnet3 model in my project to compute the speaker similarity of a pair of wavs. All the audio in my dataset is 22050Hz, for some reason I could down sample those audio to 16000 kHz. I wonder if the pretrained model is suitable to the 22050Hz audio.
Thanks

RawNet2 PyTorch weights & inference

I have a couple of questions regarding RawNet2 usage for inference:

  1. By "pre-trained model" in README you meant a model fully trained on VoxCeleb2? Or just the model after the pre-training phase on speaker identification task?
  2. There are two implementations of RawNet2 - one in model_RawNet2.py file, one in model_RawNet2_original_code.py. The "pre-trained model"'s weights are for the latter. What are the differences between them and which one should I use for inference (on other datasets, not necesserily VoxCeleb1)
  3. Are you planning on releasing the weights of the model fully trained on VoxCeleb2? I would like to experiment with it on other datasets, but, unfortunatel, don't have the ability to train it myself.

how to create the test_list for a new test dataset

Hi Jungjee,

Thanks for sharing your great work!
Could you please share the code you used to create the test_list for a new test dataset? For example, if i want to test using TIMIT corpus.

head -5 vox1_veri_test2.txt

1 id10270/x6uYqmx31kE/00001.wav id10270/8jEAjG6SegY/00008.wav
0 id10270/x6uYqmx31kE/00001.wav id10300/ize_eiCFEg0/00003.wav
1 id10270/x6uYqmx31kE/00001.wav id10270/GWXujl-xAVM/00017.wav
0 id10270/x6uYqmx31kE/00001.wav id10273/0OCW1HUxZyg/00001.wav
1 id10270/x6uYqmx31kE/00001.wav id10270/8jEAjG6SegY/00022.wav

Here is what I understand: Column-1: 1, if column-2 and column-3 are from the same speaker, 0 otherwise
For a large corpus, like VoxCeleb, how to select coulmn-2 and column-3?

Your help will be greatly appreciated!

Regards,
Willy

Missing models for RawNet

from model_RawNet_pre_train import get_model as get_model_pretrn
from model_RawNet import get_model

Hello.

I've been trying to trying to run the reproduction of the system from the original RawNet paper and found that these two modules are missing. Could you please upload them ?

About the pretrained model

Hi, when I test the model by using pretrained model rawnet2_best_weights.pt, it throws an error
in Missing key(s) in state_dict: "block0.0.frm.fc.weight", "block0.0.frm.fc.bias", "block1.0.frm.fc.weight", "block1.0.frm.fc.bias", "block2.0.frm.fc.weight", "block2.0.frm.fc.bias", "block3.0.frm.fc.weight", "block3.0.frm.fc.bias", "block4.0.frm.fc.weight", "block4.0.frm.fc.bias", "block5.0.frm.fc.weight", "block5.0.frm.fc.bias
It seems really missing these parameters after I checked the model by Netron.
Many thanks for your reply in advance.

The generalization abality

The generalization of RawNet2 is poor?
I trained RawNet2 in AISHELL dataset with 340 speaker and tested in trail.txt with 8w pairs bulit by another 40 speaker of AISHELL, and the final eer is 3.46%. But when tested in 40 speaker of VCTK dataset with 8w pairs, the eer got 32.71%. Do you know why? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.