<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="22

Precision issues in Mistral rotary embeddings about transformers HOT 10 OPEN

avnermay commented on June 15, 2024

Precision issues in Mistral rotary embeddings

from transformers.

Comments (10)

gante commented on June 15, 2024 1

@danielhanchen the inv_freq permanent buffer can be casted with .to model casting, e.g.

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = model.to(device="cuda", dtype=torch.bfloat16)
print(model.model.layers[0].self_attn.rotary_emb.inv_freq.dtype)

On Llama and Gemma that's no problem, since we're recently updated the code to cast inv_freq to float() before it is applied to get sin and cos (e.g. here). However, other RoPE models like Mistral have yet to receive the same treatment.

We'll gladly take PRs to fix it ;) We will be touching the other RoPE models soon anyways, to migrate them to a Llama-like structure (which, contrarily to other models, is compatible with torch.compile)

from transformers.

ArthurZucker commented on June 15, 2024

Do you want to open a PR to propagate the changes we made to Llama and gemma?

from transformers.

ArthurZucker commented on June 15, 2024

cc @gante

from transformers.

danielhanchen commented on June 15, 2024

@avnermay I'm not too certain, but I think inv_freq will always be calculated in float32. For eg Gemma:

self.inv_freq = 1.0 / (
self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim))

And for Llama:

inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))

The downcast only applies to matrix multiplications and explicit downcasts like what I found what they did in Keras.

I haven't ran the code to confirm, but it would be great if you can print the dtype during a finetuning run to confirm inv_freq is actually bfloat16.

from transformers.

danielhanchen commented on June 15, 2024

@gante Whoops sorry just saw this - apologies!

Oh fair points on this! Hmm is there like some sort of lockin mechanism to not allow the conversion to occur? Maybe some sort of overriding mechanism ie write over tensor.to itself

from transformers.

avnermay commented on June 15, 2024

Why not use the approach taken by the other models, that force inv_freq to be float32? The key is avoiding cases where cos and sin are recomputed using a low-precision inv_freq tensor. This occurs (for example) during mixed precision training, because inv_freq was automatically downcast to bfloat16 in that case.

from transformers.

gante commented on June 15, 2024

@danielhanchen the only solution is to explicitly upcast 😬 some frameworks like deepspeed explicitly can hijack tensor creation and force them to be initialized in a certain type (which has also caused issues with RoPE).

@avnermay that is the solution. The change is simple, but we are working on other overlapping problems -- bear with us 🤗

from transformers.

avnermay commented on June 15, 2024

Just commenting on this so that it is not marked as stale. Thanks!

from transformers.

ArthurZucker commented on June 15, 2024

#30642 will fix this ! 🤗

from transformers.

github-actions commented on June 15, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

from transformers.

Precision issues in Mistral rotary embeddings about transformers HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent