Thanks a lot for this nice implementation, but at section 3.2.2 in the original paper

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Style Token Layer implementation question about gst-tacotron HOT 20 OPEN

syang1993 commented on June 2, 2024

Style Token Layer implementation question

from gst-tacotron.

Comments (20)

syang1993 commented on June 2, 2024 3

@acetylSv

Thanks for your comment. To be honest, I'm not sure whether this implementation matches the details of the original paper, since the paper didn't talk much about it.

About the tanh, applying it before or after the attention process could compress the style embedding into the same scale of encoder state. But as the paper suggested, maybe it's better to apply it before style attention. You can add the tanh operation in the below line to match the paper.

gst-tacotron/models/tacotron.py

Line 76 in 74af4dd

    
           tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1]),            # [N, hp.num_gst, 256/hp.num_heads]

I will also compare and change it. Thanks.

from gst-tacotron.

syang1993 commented on June 2, 2024 1

@fatchord Yes, the query of the multi-head attention is from the reference encoder. The values are the style tokens, and a transform layer is applied to the tokens to get keys like other attention methods.

from gst-tacotron.

fatchord commented on June 2, 2024

@syang1993 great work! I'm having a go at this myself in pytorch - just to clarify one thing - the query to the style attention is the hidden state from the reference encoder and both the keys and values are the style tokens right?

from gst-tacotron.

fatchord commented on June 2, 2024

@syang1993 Thanks! It makes perfect sense now.

from gst-tacotron.

fatchord commented on June 2, 2024

@syang1993 sorry to be bothering you again. I'm just curious if you'd like to swap notes?

I've implemented the gst model in pytorch and the reference audio works quite well. The only problem is using the style tokens themselves without reference audio - I am not getting good results. I'm wondering have you tried using just style tokens? Any luck with it?

I've tried content attention too and again, reference audio works great but the style tokens aren't working by themselves.

from gst-tacotron.

syang1993 commented on June 2, 2024

@fatchord Surely we can communicate and work together! In my earlier experiments, I test the gst-model without reference audio. To achieve this, I used some random weights for the style tokens, sometimes it can generate good audio, but sometimes it cannot.
For the new code, I didn't test it since I'm not at the school this month. But I guess it may suffer the same problem.
I also confused this problem, I don't know whether it's because the data size or implementation error. How do you think about it? Besides, it is very helpful if you can share your pytorch repo. : ) I also began using pytorch from last month, I think I can learn a lot from you.

from gst-tacotron.

fatchord commented on June 2, 2024

@syang1993 My initial thoughts on the problem was that perhaps it was the multi-headed attention. Since it increases the effective number of tokens by num_heads (or is my intuition wrong here?). So when I picked a single style token - it would have multiple 'heads' in it - in other words, the attention mechanism would likely never pick all heads in a single style token at any one time. And same as you, I only had luck with random tokens.

So then I tried content attention and I just realised a couple hours ago that I made a silly mistake in the attention - after training for 700k steps! So I have to retrain it again. Anyway, I'm thinking the advantage of the content attention is that it will allow for straightforward selection of a single style token.

I'm in two minds about creating a repo for it. I really need to polish the code - and there could be more silly mistakes in there so I'll have recheck every single line again.

from gst-tacotron.

syang1993 commented on June 2, 2024

@fatchord Thanks for your thoughts. I'm also not so sure about the multi-head attention since the paper didn't talk the details. I will also do more experiments after I go back to school to verify this.
If you get future results or conclusions, could you share it to me?

Yeah sometime small mistake will have a bad influence, my earlier repo also has mistakes so that the performance isn't so good as current one. I'm looking forward to your new repo when you finish it.

from gst-tacotron.

fatchord commented on June 2, 2024

So I trained the content based attention for the style tokens and again, I get the same problem. Not sure how to move forward on this problem.

from gst-tacotron.

fatchord commented on June 2, 2024

@syang1993 The only thing I can think of - perhaps the softmax attention is sharpened? That would force the model to choose mainly one style token at any time - thus making it more 'natural' to condition the decoder on a single style token at inference. What do you think?

from gst-tacotron.

syang1993 commented on June 2, 2024

@fatchord Yeah I will check the weights of each tokens to see how it happens. If it always has a large value for one token, maybe this is the problem you mentioned.

But in the paper, they said they found each token had specific meaning, such as speaking speed. I also didn't find this things, and I'm not sure whether because the limited data. I'll get more data soon, then I can check it.

from gst-tacotron.

fazlekarim commented on June 2, 2024

@fatchord Is your pytorch code available online?

from gst-tacotron.

fatchord commented on June 2, 2024

@fazlekarim not at the minute. It'll probably be online sometime next week or so.

from gst-tacotron.

fatchord commented on June 2, 2024

@syang1993 Not sure if you're aware of it - but new paper from the tacotron crew: https://arxiv.org/pdf/1808.01410.pdf

from gst-tacotron.

fazlekarim commented on June 2, 2024

@fatchord thats so sad! when you are about to come up with something new, they just come up with something even better.

from gst-tacotron.

fatchord commented on June 2, 2024

@fazlekarim That's the way it goes I guess! On the upside, the additional ideas introduced in the paper should be fairly straightforward to implement.

from gst-tacotron.

syang1993 commented on June 2, 2024

@fatchord Thanks for the remind! I just toke a look at it yesterday, as you said, it's easy to implement it since they only added an extra module to predict the style embedding from text.

from gst-tacotron.

fatchord commented on June 2, 2024

@syang1993 One other thing - I had a look at the Blizzard2013 dataset, and it looks like they stripped out all the quotation marks. I think this could be a problem because the woman narrating changes her voice style dramatically when the text is in quotes. Without them, the model should find it more difficult to model prosody I think.

from gst-tacotron.

abuvaneswari commented on June 2, 2024

@fatchord Thanks for the remind! I just toke a look at it yesterday, as you said, it's easy to implement it since they only added an extra module to predict the style embedding from text.

@syang1993 is there an update to this repo with the TP-GST feature?

from gst-tacotron.

niu0717 commented on June 2, 2024

@fatchord it is very helpful if you can share your pytorch repo. : ) !

from gst-tacotron.

Style Token Layer implementation question about gst-tacotron HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent