Comments (3)
Hi, thanks for bringing this up!
Definitely an interesting point, the paper does mention that when using the auxiliary loss that it "centers the sigmoid of the router’s outputs around 0.5" which may also hint at them using sigmoid on the router weights not just for BCE.
I wasn't sure whether to do so, but since sigmoid wouldn't break causality maybe this is something we could include? I can try including it, but let me know if you think that makes sense.
As for the auxiliary router, the paper doesn't describe it in much detail but one thing that I might try is to train it after the rest of the model is trained. I originally tried to do this with an external model but found it cumbersome to store the main router predictions to train the auxiliary one. However, it doesn't make sense to me to train the auxiliary router until after the main router is learned, so maybe doing so would improve the results.
I'll try to do both and see if it changes/improves the results. Thanks again!
from mixture-of-depths.
I try softmax and yes this bad, the model is not able to create sentence correctly even if the loss decrease.
I trained model with sigmoid instead and the loss seems to decrease to a much lower loss than without sigmoid.
And for the sigmoid the sentences are correct
I think this normalization is important to keep the hidden_state of the routed tokens at the same rough size and avoid too much bigger value that cause in the RMS norm to give nan value
I originally tried to do this with an external model but found it cumbersome to store the main router predictions to train the auxiliary one.
What do you mean?
from mixture-of-depths.
Hi @ostix360,
I added sigmoid in #4, and also updated the training script for the auxiliary router. Now, I just train it after the MoD model and this seems to produce expected outputs -- which is what I mean beforehand. Thanks for the suggestion, let me know if you think it makes sense. Thanks!
from mixture-of-depths.
Related Issues (4)
- compute about attention HOT 6
- Qs on inference HOT 6
- Is the implementation wrong? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mixture-of-depths.