In the paper, the authors mentioned that the initialization followed DeepNet but from

initialization of qkv about torchscale HOT 3 CLOSED

XintianHan commented on May 20, 2024

initialization of qkv

from torchscale.

Comments (3)

XintianHan commented on May 20, 2024 1

RetNet uses DeepNet's derivation methods to obtain the initialization for better training stability, instead of directly re-using its derived initialization (on Post-LN transformers), because the initialization depends on the model architecture according to the theory in DeepNet.

Thanks for the quick reply!

"because the initialization depends on the model architecture according to the theory in DeepNet"

Could you elaborate the derivation methods more? How do you get the number 2 ** -2.5 here? Thanks

from torchscale.

shumingma commented on May 20, 2024

RetNet uses DeepNet's derivation methods to obtain the initialization for better training stability, instead of directly re-using its derived initialization (on Post-LN transformers), because the initialization depends on the model architecture according to the theory in DeepNet.

from torchscale.

radarFudan commented on May 20, 2024

I am also interested in this initialisation scheme. It seems for recurrent models such as S4 and S5, they have different schemes. Do you have any particular explanation or heuristic of this scale?

from torchscale.

Recommend Projects