Comments (9)
I believe you received the warning because your GPU memory isn't large enough to load the entire model, which in this case is Qwen1.5 14B Chat. So, when you automatically assign the model by specifically settingdevice_map='auto'
:
model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)
any parameters that cannot be loaded onto your GPU will be offloaded to your CPU. This is why you're seeing the warning:
Some parameters are on the meta device device because they were offloaded to the cpu.
I suggest trying to set device_map to a specific device, where device is set to "cuda". For example:
# assuming you have already defined 'device' somewhere
model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map=device # Example of manually specifying device map
)
You might encounter a CUDA Out of Memory
error. If that happens, consider using a quantized version of the model (such as Qwen/Qwen1.5-14B-Chat-GPTQ-Int8, GPTQ-Int8, or AWQ), selecting a smaller model size (e.g., 0.5B, 4B, 7B), or upgrading to a GPU with sufficient memory to load your target model.
Hope my suggestions help.
from qwen1.5.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
same question,,,
should we set the pad_token_id
manually?
from qwen1.5.
Just pass one parameter into the model's generate
method like this:
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
from qwen1.5.
thanks I will try.
from qwen1.5.
Just pass one parameter into the model's
generate
method like this:generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I added the code as:
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I got the right answer.
as the same warning:
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
from qwen1.5.
Just pass one parameter into the model's
generate
method like this:generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)I added the code as: generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I got the right answer.
as the same warning: WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Do you mind showing me how you define your model and the model inputs:
model = AutoModelForCausalLM.from_pretrained(...)
# ....
# Some other code here
# ....
model_inputs = tokenizer([text], return_tensors="pt").to(...)
By the way, I've tried some models in Qwen1.5 series in my machine, their behaviors might slightly differ. If it's convenient, do you mind also telling me which model you're using and some information about your GPU(s) (such as the model type and the number of them if you have more than one)?
from qwen1.5.
Just pass one parameter into the model's
generate
method like this:generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)I added the code as: generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I got the right answer.
as the same warning: WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.Do you mind showing me how you define your model and the model inputs:
model = AutoModelForCausalLM.from_pretrained(...) # .... # Some other code here # .... model_inputs = tokenizer([text], return_tensors="pt").to(...)By the way, I'v
e tried some models in Qwen1.5 series in my machine, their behaviors might slightly differ. If it's convenient, do you mind also telling me which model you're using and some information about your GPU(s) (such as the model type and the number of them if you have more than one)?
here is the code:from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat")
prompt = "把大象装到冰箱里面要几步?"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
#generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
from qwen1.5.
device_map={0: device}
I particularly appreciate your patience. I have revised the code according to your advice. Attached is my GPU status report. The issue might be related to the memory. The alarm information is also provided. The model runs normally but the reasoning speed is slow. This issue will be generated after several minutes. Thank you once again!
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:151645 for open-end generation.
from qwen1.5.
I believe you received the warning because your GPU memory isn't large enough to load the entire model, which in this case is Qwen1.5 14B Chat. So, when you automatically assign the model by specifically setting
device_map='auto'
:model = AutoModelForCausalLM.from_pretrained( "/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat", device_map="auto" )any parameters that cannot be loaded onto your GPU will be offloaded to your CPU. This is why you're seeing the warning:
Some parameters are on the meta device device because they were offloaded to the cpu.
I suggest trying to set device_map to a specific device, where device is set to "cuda". For example:
# assuming you have already defined 'device' somewhere model = AutoModelForCausalLM.from_pretrained( "/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat", device_map=device # Example of manually specifying device map )You might encounter a
CUDA Out of Memory
error. If that happens, consider using a quantized version of the model (such as Qwen/Qwen1.5-14B-Chat-GPTQ-Int8, GPTQ-Int8, or AWQ), selecting a smaller model size (e.g., 0.5B, 4B, 7B), or upgrading to a GPU with sufficient memory to load your target model.Hope my suggestions help.
I tried the 7B model and everything is fine now. It seems to be indeed a problem with my GPU memory. The new 14B version is more memory-intensive, is that because of the increased context length? The previous version of the GPU had no issues at all.
Thanks a lot!
from qwen1.5.
Related Issues (20)
- Not able to setup Qwen1.5-72B-Chat with Taskweaver HOT 1
- qwen1 和 qwen1.5 finetune脚本为什么数据预处理会不一样? HOT 1
- 💡 [REQUEST] - <有没有vLLM 部署Qwen1.5-110b-chat版本的?> HOT 2
- Qwen的代码开源了吗? HOT 4
- LlamaIndex能否直接使用modelscode下载的千问模型进行RAG检索?
- qwen1.5的sft数据处理部分使用了预训练的数据处理方式 HOT 1
- A question about the structure of the model
- Qwen1.5-72B-Chat在CEval val数据集上Zero shot评测为76.76正常吗,我看文章里都是84左右 HOT 5
- 请问有没有qwen-110b-chat的代码能力的测试结果,HumanEval,MBPP或者是LiveCodeBench等等 HOT 1
- AWQ 模型的精度和使用 HOT 1
- 关于device_map HOT 1
- Ollama库中,不同的标签后缀含义是什么? HOT 1
- Qwen1.5-32B-Chat-AWQ 加载模型报错No module named 'transformers.models.starcoder2' HOT 1
- 为什么同样的脚本(Qwen1.5/examples/web_demo.py)去执行Qwen1.5-32B-Chat-GPTQ-Int4的推理时,4090 24G 比 V100 32G 回答速度快了5倍,这是什么原因,是显卡性能的问题,还是代码还有哪个配置没有打开,导致V100的计算能力没有发挥出来? HOT 2
- FlashAttention only support fp16 and bf16 data type HOT 2
- 评测脚本问题 HOT 1
- Qwen1.5-7B-Chat推理,结果不一致 HOT 4
- unlimited output,vllm 0.4.0 post ,0.5b-chat HOT 1
- Readme inconsistent with small model licences HOT 1
- 14b模型为啥总偷懒? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qwen1.5.