安官网公开的quick start 报错模型加载正常，生成报错 about qwen1.5 HOT 9 CLOSED

qwenlm commented on June 3, 2024

安官网公开的quick start 报错模型加载正常，生成报错

from qwen1.5.

Comments (9)

XavierSpycy commented on June 3, 2024 1

I believe you received the warning because your GPU memory isn't large enough to load the entire model, which in this case is Qwen1.5 14B Chat. So, when you automatically assign the model by specifically settingdevice_map='auto':

model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)

any parameters that cannot be loaded onto your GPU will be offloaded to your CPU. This is why you're seeing the warning:

Some parameters are on the meta device device because they were offloaded to the cpu.

I suggest trying to set device_map to a specific device, where device is set to "cuda". For example:

# assuming you have already defined 'device' somewhere
model = AutoModelForCausalLM.from_pretrained(
    "/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
    device_map=device  # Example of manually specifying device map
)

You might encounter a CUDA Out of Memory error. If that happens, consider using a quantized version of the model (such as Qwen/Qwen1.5-14B-Chat-GPTQ-Int8, GPTQ-Int8, or AWQ), selecting a smaller model size (e.g., 0.5B, 4B, 7B), or upgrading to a GPU with sufficient memory to load your target model.

Hope my suggestions help.

from qwen1.5.

YingchaoX commented on June 3, 2024

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.

same question,,,
should we set the pad_token_id manually?

from qwen1.5.

XavierSpycy commented on June 3, 2024

Just pass one parameter into the model's generate method like this:

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

from qwen1.5.

hopeforus commented on June 3, 2024

thanks I will try.

from qwen1.5.

hopeforus commented on June 3, 2024

Just pass one parameter into the model's generate method like this:
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I added the code as:
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I got the right answer.

as the same warning:
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

from qwen1.5.

XavierSpycy commented on June 3, 2024

Just pass one parameter into the model's generate method like this:
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I added the code as: generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I got the right answer.

as the same warning: WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Do you mind showing me how you define your model and the model inputs:

model = AutoModelForCausalLM.from_pretrained(...)
# ....
# Some other code here
# ....
model_inputs = tokenizer([text], return_tensors="pt").to(...)

By the way, I've tried some models in Qwen1.5 series in my machine, their behaviors might slightly differ. If it's convenient, do you mind also telling me which model you're using and some information about your GPU(s) (such as the model type and the number of them if you have more than one)?

from qwen1.5.

hopeforus commented on June 3, 2024

Just pass one parameter into the model's generate method like this:
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I added the code as: generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I got the right answer.
as the same warning: WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Do you mind showing me how you define your model and the model inputs:
model = AutoModelForCausalLM.from_pretrained(...)
# ....
# Some other code here
# ....
model_inputs = tokenizer([text], return_tensors="pt").to(...)
By the way, I'v
e tried some models in Qwen1.5 series in my machine, their behaviors might slightly differ. If it's convenient, do you mind also telling me which model you're using and some information about your GPU(s) (such as the model type and the number of them if you have more than one)?

here is the code:from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat")

prompt = "把大象装到冰箱里面要几步？"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

#generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

from qwen1.5.

hopeforus commented on June 3, 2024

device_map={0: device}

I particularly appreciate your patience. I have revised the code according to your advice. Attached is my GPU status report. The issue might be related to the memory. The alarm information is also provided. The model runs normally but the reasoning speed is slow. This issue will be generated after several minutes. Thank you once again!

WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:151645 for open-end generation.

from qwen1.5.

hopeforus commented on June 3, 2024

I believe you received the warning because your GPU memory isn't large enough to load the entire model, which in this case is Qwen1.5 14B Chat. So, when you automatically assign the model by specifically settingdevice_map='auto':
model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)
any parameters that cannot be loaded onto your GPU will be offloaded to your CPU. This is why you're seeing the warning:
Some parameters are on the meta device device because they were offloaded to the cpu. 
I suggest trying to set device_map to a specific device, where device is set to "cuda". For example:
# assuming you have already defined 'device' somewhere
model = AutoModelForCausalLM.from_pretrained(
    "/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
    device_map=device  # Example of manually specifying device map
)
You might encounter a CUDA Out of Memory error. If that happens, consider using a quantized version of the model (such as Qwen/Qwen1.5-14B-Chat-GPTQ-Int8, GPTQ-Int8, or AWQ), selecting a smaller model size (e.g., 0.5B, 4B, 7B), or upgrading to a GPU with sufficient memory to load your target model.

Hope my suggestions help.

I tried the 7B model and everything is fine now. It seems to be indeed a problem with my GPU memory. The new 14B version is more memory-intensive, is that because of the increased context length? The previous version of the GPU had no issues at all.
Thanks a lot！

from qwen1.5.

安官网公开的quick start 报错模型加载正常，生成报错 about qwen1.5 HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent