Giter Site home page Giter Site logo

Comments (20)

syang1993 avatar syang1993 commented on June 2, 2024 3

@acetylSv

Thanks for your comment. To be honest, I'm not sure whether this implementation matches the details of the original paper, since the paper didn't talk much about it.

About the tanh, applying it before or after the attention process could compress the style embedding into the same scale of encoder state. But as the paper suggested, maybe it's better to apply it before style attention. You can add the tanh operation in the below line to match the paper.

tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1]), # [N, hp.num_gst, 256/hp.num_heads]

I will also compare and change it. Thanks.

from gst-tacotron.

syang1993 avatar syang1993 commented on June 2, 2024 1

@fatchord Yes, the query of the multi-head attention is from the reference encoder. The values are the style tokens, and a transform layer is applied to the tokens to get keys like other attention methods.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@syang1993 great work! I'm having a go at this myself in pytorch - just to clarify one thing - the query to the style attention is the hidden state from the reference encoder and both the keys and values are the style tokens right?

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@syang1993 Thanks! It makes perfect sense now.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@syang1993 sorry to be bothering you again. I'm just curious if you'd like to swap notes?

I've implemented the gst model in pytorch and the reference audio works quite well. The only problem is using the style tokens themselves without reference audio - I am not getting good results. I'm wondering have you tried using just style tokens? Any luck with it?

I've tried content attention too and again, reference audio works great but the style tokens aren't working by themselves.

from gst-tacotron.

syang1993 avatar syang1993 commented on June 2, 2024

@fatchord Surely we can communicate and work together! In my earlier experiments, I test the gst-model without reference audio. To achieve this, I used some random weights for the style tokens, sometimes it can generate good audio, but sometimes it cannot.
For the new code, I didn't test it since I'm not at the school this month. But I guess it may suffer the same problem.
I also confused this problem, I don't know whether it's because the data size or implementation error. How do you think about it? Besides, it is very helpful if you can share your pytorch repo. : ) I also began using pytorch from last month, I think I can learn a lot from you.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@syang1993 My initial thoughts on the problem was that perhaps it was the multi-headed attention. Since it increases the effective number of tokens by num_heads (or is my intuition wrong here?). So when I picked a single style token - it would have multiple 'heads' in it - in other words, the attention mechanism would likely never pick all heads in a single style token at any one time. And same as you, I only had luck with random tokens.

So then I tried content attention and I just realised a couple hours ago that I made a silly mistake in the attention - after training for 700k steps! So I have to retrain it again. Anyway, I'm thinking the advantage of the content attention is that it will allow for straightforward selection of a single style token.

I'm in two minds about creating a repo for it. I really need to polish the code - and there could be more silly mistakes in there so I'll have recheck every single line again.

from gst-tacotron.

syang1993 avatar syang1993 commented on June 2, 2024

@fatchord Thanks for your thoughts. I'm also not so sure about the multi-head attention since the paper didn't talk the details. I will also do more experiments after I go back to school to verify this.
If you get future results or conclusions, could you share it to me?

Yeah sometime small mistake will have a bad influence, my earlier repo also has mistakes so that the performance isn't so good as current one. I'm looking forward to your new repo when you finish it.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

So I trained the content based attention for the style tokens and again, I get the same problem. Not sure how to move forward on this problem.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@syang1993 The only thing I can think of - perhaps the softmax attention is sharpened? That would force the model to choose mainly one style token at any time - thus making it more 'natural' to condition the decoder on a single style token at inference. What do you think?

from gst-tacotron.

syang1993 avatar syang1993 commented on June 2, 2024

@fatchord Yeah I will check the weights of each tokens to see how it happens. If it always has a large value for one token, maybe this is the problem you mentioned.

But in the paper, they said they found each token had specific meaning, such as speaking speed. I also didn't find this things, and I'm not sure whether because the limited data. I'll get more data soon, then I can check it.

from gst-tacotron.

fazlekarim avatar fazlekarim commented on June 2, 2024

@fatchord Is your pytorch code available online?

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@fazlekarim not at the minute. It'll probably be online sometime next week or so.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@syang1993 Not sure if you're aware of it - but new paper from the tacotron crew: https://arxiv.org/pdf/1808.01410.pdf

from gst-tacotron.

fazlekarim avatar fazlekarim commented on June 2, 2024

@fatchord thats so sad! when you are about to come up with something new, they just come up with something even better.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@fazlekarim That's the way it goes I guess! On the upside, the additional ideas introduced in the paper should be fairly straightforward to implement.

from gst-tacotron.

syang1993 avatar syang1993 commented on June 2, 2024

@fatchord Thanks for the remind! I just toke a look at it yesterday, as you said, it's easy to implement it since they only added an extra module to predict the style embedding from text.

from gst-tacotron.

fatchord avatar fatchord commented on June 2, 2024

@syang1993 One other thing - I had a look at the Blizzard2013 dataset, and it looks like they stripped out all the quotation marks. I think this could be a problem because the woman narrating changes her voice style dramatically when the text is in quotes. Without them, the model should find it more difficult to model prosody I think.

from gst-tacotron.

abuvaneswari avatar abuvaneswari commented on June 2, 2024

@fatchord Thanks for the remind! I just toke a look at it yesterday, as you said, it's easy to implement it since they only added an extra module to predict the style embedding from text.

@syang1993 is there an update to this repo with the TP-GST feature?

from gst-tacotron.

niu0717 avatar niu0717 commented on June 2, 2024

@fatchord it is very helpful if you can share your pytorch repo. : ) !

from gst-tacotron.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.