Giter Site home page Giter Site logo

Comments (9)

XavierSpycy avatar XavierSpycy commented on June 3, 2024 1

I believe you received the warning because your GPU memory isn't large enough to load the entire model, which in this case is Qwen1.5 14B Chat. So, when you automatically assign the model by specifically settingdevice_map='auto':

model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)

any parameters that cannot be loaded onto your GPU will be offloaded to your CPU. This is why you're seeing the warning:

Some parameters are on the meta device device because they were offloaded to the cpu. 

I suggest trying to set device_map to a specific device, where device is set to "cuda". For example:

# assuming you have already defined 'device' somewhere
model = AutoModelForCausalLM.from_pretrained(
    "/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
    device_map=device  # Example of manually specifying device map
)

You might encounter a CUDA Out of Memory error. If that happens, consider using a quantized version of the model (such as Qwen/Qwen1.5-14B-Chat-GPTQ-Int8, GPTQ-Int8, or AWQ), selecting a smaller model size (e.g., 0.5B, 4B, 7B), or upgrading to a GPU with sufficient memory to load your target model.

Hope my suggestions help.

from qwen1.5.

YingchaoX avatar YingchaoX commented on June 3, 2024
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.

same question,,,
should we set the pad_token_id manually?

from qwen1.5.

XavierSpycy avatar XavierSpycy commented on June 3, 2024

Just pass one parameter into the model's generate method like this:

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

from qwen1.5.

hopeforus avatar hopeforus commented on June 3, 2024

thanks I will try.

from qwen1.5.

hopeforus avatar hopeforus commented on June 3, 2024

Just pass one parameter into the model's generate method like this:

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I added the code as:
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I got the right answer.

as the same warning:
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

from qwen1.5.

XavierSpycy avatar XavierSpycy commented on June 3, 2024

Just pass one parameter into the model's generate method like this:

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I added the code as: generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I got the right answer.

as the same warning: WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Do you mind showing me how you define your model and the model inputs:

model = AutoModelForCausalLM.from_pretrained(...)
# ....
# Some other code here
# ....
model_inputs = tokenizer([text], return_tensors="pt").to(...)

By the way, I've tried some models in Qwen1.5 series in my machine, their behaviors might slightly differ. If it's convenient, do you mind also telling me which model you're using and some information about your GPU(s) (such as the model type and the number of them if you have more than one)?

from qwen1.5.

hopeforus avatar hopeforus commented on June 3, 2024

Just pass one parameter into the model's generate method like this:

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)

I added the code as: generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
I got the right answer.
as the same warning: WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Do you mind showing me how you define your model and the model inputs:

model = AutoModelForCausalLM.from_pretrained(...)
# ....
# Some other code here
# ....
model_inputs = tokenizer([text], return_tensors="pt").to(...)

By the way, I'v
e tried some models in Qwen1.5 series in my machine, their behaviors might slightly differ. If it's convenient, do you mind also telling me which model you're using and some information about your GPU(s) (such as the model type and the number of them if you have more than one)?

here is the code:from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat")

prompt = "把大象装到冰箱里面要几步?"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

#generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

from qwen1.5.

hopeforus avatar hopeforus commented on June 3, 2024
device_map={0: device}

I particularly appreciate your patience. I have revised the code according to your advice. Attached is my GPU status report. The issue might be related to the memory. The alarm information is also provided. The model runs normally but the reasoning speed is slow. This issue will be generated after several minutes. Thank you once again!

WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:151645 for open-end generation.

截屏2024-02-09 10 09 24

from qwen1.5.

hopeforus avatar hopeforus commented on June 3, 2024

I believe you received the warning because your GPU memory isn't large enough to load the entire model, which in this case is Qwen1.5 14B Chat. So, when you automatically assign the model by specifically settingdevice_map='auto':

model = AutoModelForCausalLM.from_pretrained(
"/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
device_map="auto"
)

any parameters that cannot be loaded onto your GPU will be offloaded to your CPU. This is why you're seeing the warning:

Some parameters are on the meta device device because they were offloaded to the cpu. 

I suggest trying to set device_map to a specific device, where device is set to "cuda". For example:

# assuming you have already defined 'device' somewhere
model = AutoModelForCausalLM.from_pretrained(
    "/mnt/sdb/hope/work/models/Qwen1.5-14B-Chat",
    device_map=device  # Example of manually specifying device map
)

You might encounter a CUDA Out of Memory error. If that happens, consider using a quantized version of the model (such as Qwen/Qwen1.5-14B-Chat-GPTQ-Int8, GPTQ-Int8, or AWQ), selecting a smaller model size (e.g., 0.5B, 4B, 7B), or upgrading to a GPU with sufficient memory to load your target model.

Hope my suggestions help.

I tried the 7B model and everything is fine now. It seems to be indeed a problem with my GPU memory. The new 14B version is more memory-intensive, is that because of the increased context length? The previous version of the GPU had no issues at all.
Thanks a lot!

from qwen1.5.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.