Giter Site home page Giter Site logo

Comments (3)

sramshetty avatar sramshetty commented on August 15, 2024

Hi, thanks for bringing this up!
Definitely an interesting point, the paper does mention that when using the auxiliary loss that it "centers the sigmoid of the router’s outputs around 0.5" which may also hint at them using sigmoid on the router weights not just for BCE.

I wasn't sure whether to do so, but since sigmoid wouldn't break causality maybe this is something we could include? I can try including it, but let me know if you think that makes sense.

As for the auxiliary router, the paper doesn't describe it in much detail but one thing that I might try is to train it after the rest of the model is trained. I originally tried to do this with an external model but found it cumbersome to store the main router predictions to train the auxiliary one. However, it doesn't make sense to me to train the auxiliary router until after the main router is learned, so maybe doing so would improve the results.

I'll try to do both and see if it changes/improves the results. Thanks again!

from mixture-of-depths.

ostix360 avatar ostix360 commented on August 15, 2024

I try softmax and yes this bad, the model is not able to create sentence correctly even if the loss decrease.

I trained model with sigmoid instead and the loss seems to decrease to a much lower loss than without sigmoid.

And for the sigmoid the sentences are correct

I think this normalization is important to keep the hidden_state of the routed tokens at the same rough size and avoid too much bigger value that cause in the RMS norm to give nan value

I originally tried to do this with an external model but found it cumbersome to store the main router predictions to train the auxiliary one.

What do you mean?

from mixture-of-depths.

sramshetty avatar sramshetty commented on August 15, 2024

Hi @ostix360,

I added sigmoid in #4, and also updated the training script for the auxiliary router. Now, I just train it after the MoD model and this seems to produce expected outputs -- which is what I mean beforehand. Thanks for the suggestion, let me know if you think it makes sense. Thanks!

from mixture-of-depths.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.