bjfu-ai-institute / speaker-recognition-papers Goto Github PK

Share some recent speaker recognition papers and their implementations.

Python 99.35% Shell 0.65%

tensorflow paper-implementations speaker-recognition speaker-verification

speaker-recognition-papers's Introduction

Introduction

These are the slightly modified tensorflow/python implementation of recent speaker recognition papers. Please tell me if it is copyright infringement, I'll delete these paper as soon as I can. Our license only apply to our code these paper is not included. Thx.

The file structure is as follows：

|———pyasv
|
|—————model (folder, contain the model)
|
|—————loss (folder, contain the customized loss function)
|
|—————papers (folder, contain the origin paper of most of method)
|
|—————backend(TODO: folder, contain the method of backend)
|
|———data_manage.py (contain some method to manage data)
|
|———speech_processing.py (contain some method to extractfeature and process audio)
|
|———config.py (settings. e.g. save path, learning rate)

More info: Doc

If you want run these code on your computer, you only need to write code like this:

from pyasv import Config
from pyasv.speech_processing import ext_mfcc_feature
from pyasv.data_manage import DataManage
from pyasv.model.ctdnn import run

config = pyasv.Config(name='my_ctdnn_model',
                    n_speaker=1e3,
                    batch_size=64,
                    n_gpu=2,
                    max_step=100,
                    is_big_dataset=False,
                    url_of_bigdataset_temp_file=None,
                    learning_rate=1e-3,
                    slide_windows=[4, 4]
                    save_path='/home/my_path')
config.save('./my_config_path')

frames, labels = ext_mfcc_feature('data_set_path', config)
train = DataManage(frames, labels, config)

frames, labels = ext_mfcc_feature('data_set_path', config)
validation = DataManage(frames, labels, config)

run(config, train, validation)

TODO

Implement papers of ICASSP 2018 & Interspeech 2018.
Compare each model on a same dataset.

Implemented papers:

L. Li, Z. Tang, D. Wang, T. Zheng, "Deep Speaker Feature Learning for Text-Independent Speaker Verification."
L. Li, Z. Tang, D. Wang, T. Zheng, "Full-info Training for Deep Speaker Feature Learning," ICASSP 2018.
C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, Z. Zhu, "Deep Speaker: an End-to-End Neural Speaker Embedding System."
Sergey Novoselov, Oleg Kudashev, Vadim Shchemelinin, Ivan Kremnev, Galina Lavrentyeva, "DEEP CNN BASED FEATURE EXTRACTOR FOR TEXT-PROMPTED SPEAKERRECOGNITION."

speaker-recognition-papers's People

Contributors

Stargazers

Watchers

speaker-recognition-papers's Issues

Low validation accuracy while training for 50 speakers

Hi,

First of all, thank you for sharing your implementation of CTDNN model for ASV. I have been trying to use your code for training a 50 speakers model. However, I am unable to achieve a validation accuracy above 2.7% no matter what. I have tried various ways of parameter tuning and loss and optimizer customizations but to no effect. I have been trying to replicate the inputs as mentioned in the CTDNN paper with a sliding window of size 9 and 40 f-bank dimensions. I am calculating validation accuracy using Jaccard similarity. The loss doesn't decrease even after training for a long duration. It will be really great if you could share some pretrained model with me, or guide me in reproducing the results you got while implementing.

Ctdnn approach dount

Hi Fang,

Thanks for implementing these research papers.Means a lot.

Would want to know what does 9,40,1 represent in this line of code
How can i use mfcc feature extractor to create this array.

tf.placeholder(tf.float32, shape=[None, 9, 40, 1], name='pred_x')

Thanks

Validation Accuracy low for Deep Speaker Model

Hi @vzxxbacq

I have been trying to train my model for 18 speakers but the validation accuracy is really low and I have tried it for various different architecture but the training as well as validatoin accuracy is low as well as model is not converging well. Can youplease help me out and tell me what is going wrong ?

Thank you

How to generate enrollment data for testing CTDNN model?

Hi @vzxxbacq, can you please tell me what is enrollment data? Also can you walk me through how can I test my saved model which I have trained for 100 speakers? Thanks!

疑问

请问，为什么程序文件夹pyasv中的config.py文件为什么与html文件夹中给出的config.html网页中的内容不一样

ModuleNotFoundError: No module named 'pyasv.data_manage'

As mentioned in the documentation data_manage present in the repo.

Using DataManage4BigData class

Hello @vzxxbacq ,

I wanted to use the class DataManage4BigData but am stuck with few queries. I have listed them down so can you please help me with the same?

There is a parameter 'split_type' for the class, what should be its value? Should it be 'train' or 'validation' depending upon the types of clips we are using ?
There is a write-file method in it which uses the extracted features but since my dataset is huge its showing me Memory Error for once i extract features for whole dataset. I wanted to know is there anything I am doing wrong in the process for training using large dataset?

Thank you for your help! :)

Fighting Wolf

I will appreciate it if Fangshen can provide documents or guidance in Chinese. Because I! Am! Chinese!

Invalid compatible shapes caused by op LogicalAnd in triplet_loss.py

Hi @vzxxbacq ,

Thanks a lot for sharing the work and codes for deep speaker. I was trying to run the script deep_speaker.py and I encountered some error. I have changed batch size to 4 and am not using any gpu for training, the _main() function is as follows.

def _main():
    """
    Test model.
    """
    from pyasv.data_manage import DataManage
    from pyasv import Config
    import sys
    sys.path.append("../..")
    config = Config(name='deepspeaker', n_speaker=10, batch_size=4, n_gpu=0, max_step=20, is_big_dataset=False,
                 learning_rate=0.01, save_path='./dataset/save', conv_weight_decay=0.01, fc_weight_decay=0.01, bn_epsilon=1e-3 )
    x = np.random.random([120, 100, 64, 1])
    y = np.random.randint(0, 10, [120, 1])
    train = DataManage(x, y, config)

    x = np.random.random([64, 100, 64, 1])
    y = np.random.randint(0, 10, [64, 1])
    validation = DataManage(x, y, config)

    run(config, train, validation)

and the error i encountered is as follows :

  File "deep_speaker.py", line 512, in <module>
    _main()
  File "deep_speaker.py", line 508, in _main
    run(config, train, validation)
  File "deep_speaker.py", line 459, in run
    _no_gpu(config, train, validation)
  File "deep_speaker.py", line 247, in _no_gpu
    feed_dict={x: batch_x, y: batch_y})
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [4,4] vs. [4,4,10]
	 [[Node: LogicalAnd = LogicalAnd[_device="/job:localhost/replica:0/task:0/device:CPU:0"](LogicalNot, Equal_1)]]

Caused by op u'LogicalAnd', defined at:
  File "deep_speaker.py", line 512, in <module>
    _main()
  File "deep_speaker.py", line 508, in _main
    run(config, train, validation)
  File "deep_speaker.py", line 459, in run
    _no_gpu(config, train, validation)
  File "deep_speaker.py", line 223, in _no_gpu
    model = DeepSpeaker(config=config, x=x, y=y)
  File "deep_speaker.py", line 49, in __init__
    self._build_train_graph(x, y)
  File "deep_speaker.py", line 80, in _build_train_graph
    self._loss = self._triplet_loss(output, y)
  File "deep_speaker.py", line 142, in _triplet_loss
    loss = triplet_loss.batch_hard_triplet_loss(targets, inp, 1.0)
  File "/speaker-recognition-papers/pyasv/loss/triplet_loss.py", line 239, in batch_hard_triplet_loss
    mask_anchor_positive = _get_anchor_positive_triplet_mask(labels)
  File "/speaker-recognition-papers/pyasv/loss/triplet_loss.py", line 94, in _get_anchor_positive_triplet_mask
    mask = tf.logical_and(indices_not_equal, labels_equal)
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2401, in logical_and
    "LogicalAnd", x=x, y=y, name=name)
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/Users/Desktop/speaker_verification/speaker_env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [4,4] vs. [4,4,10]

I tried with even different number of speakers but it still is giving me this error. Can you please let me know why is it giving this error?

Thanks a lot !