Hi, First of all, thank you for sharing your implementation of CTDNN

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Low validation accuracy while training for 50 speakers,about bjfu-ai-institute/speaker-recognition-papers

Comments (19)

vzxxbacq commented on June 12, 2024 2

Hi, @gamesterrishi
I have fixed this bug, I forget to remove the first tf.nn.softmax. I wrote the softmax by myself at first time. After removing it and add tf.reduce_mean the model works correctly. I'm sorry I was so careless. And feel free to pull request, I am grateful for your help.

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024 1

Hi,

Very sorry for the late reply. I'm very to hear my code is helpful to you. I got worse result than the paper just like you. I use 40GB dataset to train my model and use 50 speakers to test it. The result is a bit worse than i-vector approch.

Here's some suggestions. 1. we should use cmvn. (I forgot to push it to master, sorry.) 2. we can try different initial approachs. 3. we can use regularization. Actually I should have done this, but I got a summer internship. Very sorry for that, I'll do these as soon as I can. I am not an expert in ASV, but I hope these will be helpful, hope we can keep in touch.

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024 1

Hi @gamesterrishi

I restructure these model, these model support no_gpu trainning now.

About your questions, I'm using a 40GB dataset (220K uttrs, 2400 speakers) and in my opinion, validation is similar with test. So, in this model we will update speaker embedding every epoch and calc speaker embedding of our validation dataset then score them just like what we do in test. In fact, this operation is slow, you can comment it if you don't need it.

Please contact me if you meet any trouble~

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024 1

Hi, @gamesterrishi . I'm glad we learn together. I'll implement restore method and predict method in this weekend. After you point out the P-norm layer isn't simple fc layer, I found the origin paper of it. (link) I'll implement it after I read this paper. Thank you.

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

Hi,
Sorry for the late reply, and thanks for your answer to this question. You are very active in making changes to your code, and that's very helpful to me. Can you tell me what result are you getting on your dataset - what accuracy? And what dataset are you using? I am using audio files scrapped from Librovox using Selenium. I have taken CD quality audio files for 50 speakers and I am creating f-banks for the same with a sliding window of size 9 and 40 filter banks. I am creating a numpy array of the shape [?,9,40,1] for train targets and a numpy array of the shape [?,50] for train labels.
Can you also tell me how are you calculating the validation accuracy?

I am using sklearn's implementation of jaccard similarity for calculating the validation accuracy on a split on the train frames itself. Can you please have a look at the code and point out if something is wrong? The predictions variable is calculated using the inference method defined in your class.

def validation_acc(self, sess, validation_frames, validation_labels):
predictions = sess.run(self.prediction,
feed_dict={'x:0':validation_frames,'y_:0':validation_labels})
accuracy = jaccard_similarity_score(np.argmax(validation_labels,1).reshape(validation_labels.shape[0]), np.argmax(predictions,1).reshape(predictions.shape[0]))
return accuracy

Also, using cmvn I improved my accuracy from 2.7 to 3.4%. Am I doing something wrong with the fbanks calculation? As far as I can understand the initial input vector excepts an array ie train frames of a shape [?, 9, 40,1]. Very sorry for asking for so much help but even I am new to ASV and I really wanted to implement the paper for CTDNN and get good results.
I think if you can share an idea about the kind of data you are using and the accuracy you are getting I think we can try to improve the same using different approaches. Another big issue is that I have to try and run any updated code of yours on CPU because GPU is always occupied by some other people, and for that I have to make many changes dependent on the variable self.gpu_ind. If possible, adding a boolean flag for cpu_only to your config file along with dependent changes on that flag vairable in ctdnn.py file would be a big upgrade to your Config class from the point of view of people who want to use the code on CPU only. I really appreciate your work and help, thanks a lot once again.

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

Thanks @vzxxbacq. Thanks for adding the method. Are you using some open source dataset? And what is the duration of those utterances in your case? I took 3800 utterances having duration of 5 to 10 secs as my new dataset. Could you point me to yours? Thanks

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024

In fact, 3800 utterances dataset is too small for a DNN model.

Aishell dataset is a great open source dataset. They have published a great dataset few years ago you can download it easily link , I forget the specific size of it.

Recently they publish a new dataset which is very huge. But if you want to use it, you can only ask your teacher to write an application to them and then they will send your a disk.

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

I had been using datasets from OpenSLR too. I have been trying the LibriSpeech train-clean300 dataset recently with your new code (this time on 2 GPUs) but I am getting an error:
Invalid argument: Shapes of all inputs must match: values[0].shape = [2048,10] != values[1].shape = [1228,10]
with a batch size of 4096. I tried to run it on 2 GPUs with different values for batch sizes but it always fails. I guess the error comes around the code:
batch_x, batch_y = validation.next_batch
inp_dict = feed_all_gpu({}, models, val_payload_per_gpu, batch_x, batch_y)
I also suggest that you include the line batch_x = batch_x.reshape(-1, 9, 40, 1) for avoiding an error on batch_x. I am not sure what kind of alteration should I do to batch_y so that I can solve the above error. Any pointers would be helpful, thanks

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024

Hi @gamesterrishi , thanks for your reply. I have fixed it. And I found an other bug that validation 's size must be larger than one batch, sorry I haven't gotten any idea about it. If you have any suggestion about this bug or find other bug, please contact me~

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

Hi @vzxxbacq , thanks for the quick fix. One of the possible reasons could be that the first dimension of both entire train and validation examples must be exactly divisible by batch size. I tried testing this possibility but I end up getting an infinite train loss and an eventual terminal hang up at the end of first epoch. I am trying to find a solution to this problem since yesterday, will let you know if I am up with something.

After making use of updated code, I have 292 batches for a batch size of 5120 on 2 GPUs. At the end of first epoch, I get an inf training loss and the following error:

batch_290, batch_loss=2.3023, payload_per_gpu=2560
batch_291, batch_loss=2.3022, payload_per_gpu=2560
Train loss:inf
/pyasv/model/ctdnn.py:430: RuntimeWarning: invalid value encountered in float_scalars
return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * (np.linalg.norm(vector2)))
Traceback (most recent call last):
File "speakai_ctdnn.py", line 51, in
main()
File "speakai_ctdnn.py", line 48, in main
run(config, train, validation)
File "/pyasv/model/ctdnn.py", line 463, in run
_multi_gpu(config, train, validation)
File "/pyasv/model/ctdnn.py", line 419, in _multi_gpu
correct_pred = np.equal(np.argmax(ys, 1), vec_preds)
ValueError: operands could not be broadcast together with shapes (158720,) (1495040,)

EDIT: I was getting training loss as inf because I had changed the return statement for _inference function from
return tf.nn.softmax(output), feature_layer,
to
return output, feature_layer
After changing it back to what you had written, I am getting a valid training loss. I dont understand the fact why are we using softmax twice. Once over here and then again in build_train_graph
self._prediction = tf.nn.softmax(out)

I also solved the above value error I was iterating in range over feature instead of feature_ variable. Sorry for the trouble, but the loss still seems to remain constant.

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

Hi, the loss and accuracy remains constant and is not minimizing even after several epochs
My validation accuracy for 10 speakers on OpenSLR's American English Corpus is coming as 14.4% constantly. Can you please let me know what accuracy are you achieving? Is there any other method in which we can improve it? Also, are we calculating the validation accuracy in the right way?

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

Hi @vzxxbacq, thanks for the quick fix. The loss now is certainly decreasing but for each epoch it fails to go down below a certain value. I think that we might be missing some important step in the model. The accuracy which I am getting now is 14% (better than previous). I have been working on dimensionality reduction techniques like LDA and t-SNE on the d-vectors feature layer to reduce the dimensions from 400 to 150 as per the paper which says LDA gives better result. Also, I guess that the p-norm full connect layers must involve some operation related p-norm rather than just being full connect layers. I am just speculating and I might be wrong but I think we could think in this direction.

EDIT: I am getting around 57% accuracy on 100 speakers. I increased the amount of data and number of speakers in a change of my dataset and it looks impressive. Going to try once with a dataset of 921 speakers. Let's hope we can reach 70+% mark. Thanks for your help 👍 I am splitting my dataset into test, train and validation. Do you think it might be possible for you to add a function for prediction like in your previous versions, if you have some time?

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

I will study about p norm too and help in implementing the same. As far as my understanding goes, each of the p norm layers would be a maxout layer having p norm activation. I think TF already allows us to use maxout layer but I am not sure about how to go about a p-norm maxout layer. Anyway, I will try to proceed with maxout as of now and let you know about the result.

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

Hi @vzxxbacq, I have completed implementing pnorm. I have also made some changes to the model layers as per my understanding of the paper. I have been running the model for 100 speakers and my validation accuracy is around 70% in 15 epochs till now, I am going to run it for more time. I also made some changes in DataManage4BigData to create DataManage object for data of 921 speakers. I will create a pull request soon enough.

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024

Hi, @gamesterrishi . Thanks for your contribution, it's very helpful~

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

I am very grateful to you. Please bear a small delay for receiving my pull request.

from speaker-recognition-papers.

gamesterrishi commented on June 12, 2024

Hi @vzxxbacq please let me know about your results if you get a chance to train with your data. I am getting 76% accuracy till now and its increasing slowly for 100 speakers LibriSpeech Dataset. If you think of a better interpretation of the paper, we can implement it in that way. Also, if you get time can you put the predict/restore methods. Meanwhile, I am thinking of some hyperparameter tuning strategy. Thanks your help is appreciated!

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024

Hi @gamesterrishi , I have commited the restore method, enjony it~
Now I'm trying to add PLDA method to pyasv.backend, PLDA will improve result a lot. And maybe we can apply some trick to this super parameters like grid-search.

from speaker-recognition-papers.

vzxxbacq commented on June 12, 2024

Considering the acc has been improved, can we close this issue?

from speaker-recognition-papers.

Low validation accuracy while training for 50 speakers about speaker-recognition-papers HOT 19 CLOSED

Comments (19)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent