Giter Site home page Giter Site logo

Comments (5)

Walleclipse avatar Walleclipse commented on August 21, 2024 1

As deep-speaker paper said, that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings can be used for many tasks, including speaker identification, verification, and clustering. Generally, there are 2 different applications.

  1. Classification. You need to use softmax classification layer, just as pretraining. py. The data requires that the training set and the test set have the same speaker. If there are people in the test set who do not appear in the training set, it is impossible to classify them.
  2. Verification. Verify that voice A is uttered by speaker B. We get the Embedding of voice A through the network, and then calculate the similarity with the Embedding of speaker B (e.g. cosine similarity). We set a threshold. If the similarity is higher than this threshold, we can judge that voice A is uttered by speaker B. The training set and the test set can have different speakers, because you just need to verify whether a voice A is uttered by a speaker B or not.
    I'm not sure what you need to do. For me, I used LibriSpeesh to train Embedding, that is, the same person's voice Embedding has high similarity, different people's voice similarity is low, this does not require the same speaker training set and test set.

根据 deep-speaker 论文 中的表述,该网络输入音频,输出Embedding的向量表示,这个Embedding向量可以用于多种用途。

  1. 分类。你需要接softmax层,就像pretraining.py 一样。数据要求训练集和测试集拥有同样的speaker。如果是测试集中有未出现在训练集中的人,那么无法进行分类。
  2. 验证。验证语音A是不是speaker B发出的。通过网络得到语音A的Embedding,再跟我们训练好的 speaker B的Embedding 进行相似度计算 (比如:cosine similarity),我们设定一个阈值,如果相似度高于这个阈值我们就判断语音A是 speaker B发出的。训练集和测试集可以有不同的speaker,因为你只需要验证一个语音是不是某一个speaker的而已。
    我不太清楚你需要完成什么任务。对于我来说,我把LibriSpeesh 用在了训练Embedding上,也就是同一个人发出的语音的Embedding相似度高,不同的人的语音相似度低,这个就不需要要求训练集和测试集的speaker一样,我只关心是不是同一个人。

from deep_speaker-speaker_recognition_system.

Walleclipse avatar Walleclipse commented on August 21, 2024

Hello,
(1.1) In model.py, "convolutional_model" is same as CNN model in the paper. "convolutional_model_simple" is a simplified CNN model, which has fewer parameters, used less gpu-memory, and faster speed. But in our experiment, the original CNN model has better performance than simpler one. If you want to get better performance, please try to use deeper network our newest fashion CNNs, like (dilated CNN ...)
(1.2) About GRU model, I have not adjusted it carefully, and which has poor performance in my experiment.
(2) In silence_detector.py, "wan_fn" is a path of speech file (.wav) you need to detect silence.

(1.1) 我在 convolutional_model_simple.py 中对原论文中的模型进行了简化得到简化的模型“convolutional_model_simple”,这样模型消耗更少的资源,速度更快。但是实验中,简化模型的性能比原来的“convolutional_model”模型要差一点。如果你想得到更好的性能的话,可以试试改变网络结构,或者一些其他的最新方法, 比如:dilated CNN
(1.2) 我没有很好的调出GRU模型,实验中的性能比CNN差。你可以试着改一下网络结构。
(2) (wav_fn) 是你需要检测静音的音频文件的路径。你把音频文件的路径写在(wav_fn),运行这个.py 文件,则会输出静音所在的时间段

from deep_speaker-speaker_recognition_system.

guanjian729 avatar guanjian729 commented on August 21, 2024

你好,再请教您一个数据集问题:在LibriSpeesh中train/test下一级目录为说话人编号,但是在训练集和测试集合中并不会有相同的说话人的不同语音,即例如test中如有编号61的说话人的数条语音,但训练集中却没有编号61,也就是在LibriSpeesh的SPEAKERS.TXT不会有同一个说话人的两条记录。那么例如我现在在做GMM-UBM方法的说话人识别,训练时生成的训练编号.gmm,测试时由于无法找到同一说话人的GMM,因此识别率会永远为0。请教下怎么弄,还是我对LibriSpeesh数据集的理解有问题?
再次感谢您的帮助。
Hello, let me ask you another data set question: in LibriSpeesh, the next directory of train/test is the speaker number, right? But there will not be different voices of the same speaker in the training set and the test set. For example, if there are several voices of the speaker with number 61 in the test, but there is no number 61 in the training set. That is to say, SPEAKERS.TXT in LibriSpeesh will not have two records of the same speaker. So for example, I am doing GMM-UBM method for speaker recognition, training number. GMM generated during training, because the same speaker GMM can not be found during testing, so the recognition rate will always be 0. What can I do about it, or do I have a problem understanding the LibriSpeesh data set?
Thank you again for your help.

from deep_speaker-speaker_recognition_system.

guanjian729 avatar guanjian729 commented on August 21, 2024

明白了,非常感谢!
ok,thanks!

from deep_speaker-speaker_recognition_system.

Walleclipse avatar Walleclipse commented on August 21, 2024

from deep_speaker-speaker_recognition_system.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.