Hi First of all thanks for your implementation! For the selected tokens you mu

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Normalize topk_weight about mixture-of-depths HOT 3 CLOSED

ostix360 commented on August 15, 2024

Normalize topk_weight

from mixture-of-depths.

Comments (3)

sramshetty commented on August 15, 2024

Hi, thanks for bringing this up!
Definitely an interesting point, the paper does mention that when using the auxiliary loss that it "centers the sigmoid of the router’s outputs around 0.5" which may also hint at them using sigmoid on the router weights not just for BCE.

I wasn't sure whether to do so, but since sigmoid wouldn't break causality maybe this is something we could include? I can try including it, but let me know if you think that makes sense.

As for the auxiliary router, the paper doesn't describe it in much detail but one thing that I might try is to train it after the rest of the model is trained. I originally tried to do this with an external model but found it cumbersome to store the main router predictions to train the auxiliary one. However, it doesn't make sense to me to train the auxiliary router until after the main router is learned, so maybe doing so would improve the results.

I'll try to do both and see if it changes/improves the results. Thanks again!

from mixture-of-depths.

ostix360 commented on August 15, 2024

I try softmax and yes this bad, the model is not able to create sentence correctly even if the loss decrease.

I trained model with sigmoid instead and the loss seems to decrease to a much lower loss than without sigmoid.

And for the sigmoid the sentences are correct

I think this normalization is important to keep the hidden_state of the routed tokens at the same rough size and avoid too much bigger value that cause in the RMS norm to give nan value

I originally tried to do this with an external model but found it cumbersome to store the main router predictions to train the auxiliary one.

What do you mean?

from mixture-of-depths.

sramshetty commented on August 15, 2024

Hi @ostix360,

I added sigmoid in #4, and also updated the training script for the auxiliary router. Now, I just train it after the MoD model and this seems to produce expected outputs -- which is what I mean beforehand. Thanks for the suggestion, let me know if you think it makes sense. Thanks!

from mixture-of-depths.

Normalize topk_weight about mixture-of-depths HOT 3 CLOSED

Comments (3)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent