Giter Site home page Giter Site logo

qwen1.5's People

Contributors

bug-orz avatar eltociear avatar fyabc avatar ganeshkrishnan1 avatar haiasd avatar huybery avatar hzhwcmhf avatar jianxinma avatar jklj077 avatar justinlin610 avatar jxst539246 avatar michaelvll avatar openvino-dev-contest avatar osanseviero avatar tpoisonooo avatar tuhahaha avatar yangapku avatar yijia2413 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qwen1.5's Issues

运行API服务时提示: Tokenizer class Qwen2Tokenizer does not exist or is not currently imported.

用 openai_api.py 运行API服务的时候报错
模型选择的是:Qwen1.5-14B-Chat-GPTQ-Int8
python3 openai_api.py --checkpoint-path /root/Qwen1.5-14B-Chat-GPTQ-Int8 --server-port 18000

Traceback (most recent call last):
File "openai_api.py", line 574, in
tokenizer = AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 724, in from_pretrained
raise ValueError(
ValueError: Tokenizer class Qwen2Tokenizer does not exist or is not currently imported.

提示 Qwen2Tokenizer 不存在,请问如何解决,谢谢

vllm deploy failed

**deploy platform:**RTX3090-24G
**model:**Qwen1.5-7B-Chat-GPTQ-Int4
bash:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --served-model-name qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --quantization gptq

error:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24320). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

VLLM还不支持Qwen1.5是吗?

起始日期 | Start Date

No response

实现PR | Implementation PR

VLLM还不支持Qwen2ForCausalLM是吗?

相关Issues | Reference Issues

No response

摘要 | Summary

VLLM还不支持Qwen2ForCausalLM是吗?

基本示例 | Basic Example

VLLM还不支持Qwen2ForCausalLM是吗?

缺陷 | Drawbacks

VLLM还不支持Qwen2ForCausalLM是吗?

未解决问题 | Unresolved questions

No response

1.5-72B-Chat 显存占用270GB,推理速度只有3 Tokens/s

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

1.5-72B-Chat 显存占用是72B-Chat的一倍,性能只有72B-Chat的一半,模型参数显示"torch_dtype": "bfloat16",实际加载完打印出来是:float32
模型地址:https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/files

image
image
参数文件地址:https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/file/view/master/config.json?status=1
{
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 24576,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 64,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.37.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
模型加载完,打印实际参数类型:
image

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

#chat_demo_72B-1.5.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os, time
from tqdm import tqdm
import pdb

device = "cuda"
model_path = "/data/shared/Qwen1.5/Qwen1.5-72B-Chat"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", use_cache=True)

total_memory_allocated = 0
device_count = torch.cuda.device_count()
for device_id in range(device_count):
torch.cuda.set_device(device_id)
total_memory_allocated += torch.cuda.memory_allocated(device_id)
total_memory_allocated_gb = total_memory_allocated / (1024**3)
print(f"\nTotal memory allocated across all visible CUDA devices: {total_memory_allocated_gb} GB\n")

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [
{"role": "system", "content": "你是一个有用的助手。"},
]

while True:
user_input = input("\nUser: ")
if user_input == "quit":
print("exit")
break
messages.append({"role": "user", "content": user_input})
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
pbar = tqdm(total=1, desc="生成中", leave=True)
start_time = time.time()
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512,
pad_token_id=tokenizer.eos_token_id
)
end_time = time.time()
generation_time = end_time - start_time
pbar.update(1)
pbar.close()
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]


print("\nQwen-Chat: ", response)
messages.append({"role": "system", "content": response})
total_generated_tokens = sum(len(ids) for ids in generated_ids)
total_characters = len(response)
chars_per_second = total_characters / generation_time
tokens_per_second = total_generated_tokens / generation_time
print(f"Response length: {len(response)} characters")
print(f"Response time: {generation_time:.2f}s")
print(f"Characters per second: {chars_per_second:.2f}")
print(f"Tokens per second: {tokens_per_second:.2f}")

运行环境 | Environment

- OS:Alibaba Cloud Linux release 3 (Soaring Falcon)
- Python:Python 3.8.10
- Transformers:4.37.2
- PyTorch:2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.7

备注 | Anything else?

No response

有同学部署qwen1.5可以正常聊天吗

使用FastChat部署的api/web,chat/completion 都会提前结束,看起来像是未生成完,apply_chat_template后使用completion也会

使用vllm.entrypoints.openai.api_server会生成很多\n,尝试了issue里的方法还没有发现work的,直接把\n作为stop有点顾此失彼,不过qwen/Qwen1.5-14B-Chat没有这个情况,qwen/Qwen1.5-14B-Chat-GPTQ-Int4偶尔有, qwen/Qwen1.5-72B-Chat-GPTQ-Int4严重,4B-Chat和GPTQ-Int4严重

messages = [{"role": "user", "content": "你是什么模型?"}]
response = openai.ChatCompletion.create(
    model=model_id,
    messages=messages,
    top_k=50,
    top_p=0.7,
    repetition_penalty=1,
    temperature=0.7,
    max_tokens=512,
    stream=True
)

Who knows what GQA and SWA refer to?

These two terms appear in the README.md:

We have not integrated GQA and mixture of SWA and full attention in this version and we will add the features in the future version.

How to enable flash attention 2?

Wonderful work!
I wonder that whether you train model with flash attention 2 or not? And how to enable flash attention 2?
Have you test flash attention 2 will make model performance degradation?

multi gpu dpo training have an error:Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Traceback (most recent call last):
File "Sakura_DPO_qwen.py", line 339, in
fire.Fire(train)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "Sakura_DPO_qwen.py", line 329, in train
dpo_trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2772, in training_step
loss = self.compute_loss(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 1055, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 996, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 960, in concatenated_forward
all_logits = model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
loss = self.module(*inputs, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/peft_model.py", line 1083, in forward
return self.base_model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
outputs = self.model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1003, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

使用vllm.entrypoints.openai.api_server启动模型时,调用API输出结果不正确

vLLM
We advise you to use vLLM>=0.3.0 to build OpenAI-compatible API service. Start the server with a chat model, e.g. Qwen1.5-7B-Chat:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat

Then use the chat API as demonstrated below:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "你是谁?"}
    ]
    }'

输出结果结尾有很多'\n\n\n'

在调用接口时增加参数"add_generation_prompt":True解决该问题。

AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

尝试使用Qwen1.5-7B-Chat-GPTQ-Int8模型进行聊天,遇到
(欢迎使用 Qwen-Chat 模型,输入内容即可进行对话,:h 显示命令帮助。)

Note: This demo is governed by the original license of Qwen.
We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.
(注:本演示受Qwen的许可协议限制。我们强烈建议,用户不应传播及不应允许他人传播以下内容,包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)

User> 自我介绍一下?
Traceback (most recent call last):
File "D:\work\AI\HK_Integration\Qwen\cli_demo.py", line 210, in
main()
File "D:\work\AI\HK_Integration\Qwen\cli_demo.py", line 198, in main
for response in model.chat_stream(tokenizer, query, history=history, generation_config=config):
File "D:\software\Anaconda3\envs\HK_Integration_env\lib\site-packages\torch\nn\modules\module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

期望行为 | Expected Behavior

.

复现方法 | Steps To Reproduce

具备所有依赖条件,运行cli_demo.py

运行环境 | Environment

- OS:windows10
- Python:3.8
- Transformers:4.37.0
- PyTorch:2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8

备注 | Anything else?

.

安官网公开的quick start 报错 模型加载正常,生成报错

WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:151645 for open-end generation.

chat 模型不再提供chat方法了么?

AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

原来为了方便和通用其他模型,用的chat方法,然而qwen1.5提示没有chat方法了?

Stuck when generating long text/生成长文本时卡在生成generate_ids步骤

When I input a long prompt to the model, it gets stuck in the generation step. The GPU usage increases briefly and then falls back, but the memory occupied by the tokens is not released, and it just stays stuck there. According to the code, it seems to be stuck at this step:
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
This problem occurs when the text length reaches about 400~500 tokens. Short texts can be generated normally. I reproduced this problem on both the official AWQ quantized model and my own fine-tuned model.

向模型输入了长度比较长的prompt后会卡在输出那步,显卡占用短暂升高然后回落,token占用的显存也没有释放,然后就一直卡在那一步了。看代码的话应该是卡在了
generated_ids = model.generate(
                model_inputs.input_ids,
                max_new_tokens=512
            )
这一步上。大概文本长度达到4、500的时候就会出现。短文本(200tokens以内)能正常生成回复。我在官方提供的AWQ量化模型和自己微调的模型上都复现了这个问题。

运行示例代码,出现如下提示,该怎么处理?

2024-02-07 06:36:08,182 - modelscope - INFO - PyTorch version 1.12.1 Found.
2024-02-07 06:36:08,183 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2024-02-07 06:36:08,221 - modelscope - INFO - Loading done! Current index file version is 1.12.0, with md5 155281490ae47dfe7d8f4ba91b079bfc and a total number of 964 components indexed
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:151645 for open-end generation.

KeyError: 'qwen2' Successfully installed transformers-4.37.2 vllm

use modelscope jupiter:
!pip install modelscope
import os
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
!pip install vllm
!pip install --upgrade transformers==4.37.2

  ms-swift 1.5.1 requires transformers<4.37,>=4.33, but you have transformers 4.37.2 which is incompatible.
  Successfully installed transformers-4.37.2

from vllm import LLM
llm = LLM(model="Qwen/Qwen1.5-0.5B",trust_remote_code=True,gpu_memory_utilization=0.95) # Create an LLM.
prompts='叶文杰在三体里的角色'
outputs = llm.generate(prompts)

qwen model has downloaded from modelscope,then get following error:

KeyError Traceback (most recent call last)
Cell In[21], line 2
1 from vllm import LLM
----> 2 llm = LLM(model="Qwen/Qwen1.5-0.5B",trust_remote_code=True,gpu_memory_utilization=0.95) # Create an LLM.
3 prompts='叶文杰在三体里的角色'
4 outputs = llm.generate(prompts) # Generate texts from the prompts.

File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:105, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, **kwargs)
87 kwargs["disable_log_stats"] = True
88 engine_args = EngineArgs(
89 model=model,
90 tokenizer=tokenizer,
(...)
103 **kwargs,
104 )
--> 105 self.llm_engine = LLMEngine.from_engine_args(engine_args)
106 self.request_counter = Counter()

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:304, in LLMEngine.from_engine_args(cls, engine_args)
302 """Creates an LLM engine from the engine arguments."""
303 # Create the engine configs.
--> 304 engine_configs = engine_args.create_engine_configs()
305 parallel_config = engine_configs[2]
306 # Initialize the cluster.

File /opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py:218, in EngineArgs.create_engine_configs(self)
215 def create_engine_configs(
216 self,
217 ) -> Tuple[ModelConfig, CacheConfig, ParallelConfig, SchedulerConfig]:
--> 218 model_config = ModelConfig(self.model, self.tokenizer,
219 self.tokenizer_mode, self.trust_remote_code,
220 self.download_dir, self.load_format,
221 self.dtype, self.seed, self.revision,
222 self.tokenizer_revision, self.max_model_len,
223 self.quantization, self.enforce_eager,
224 self.max_context_len_to_capture)
225 cache_config = CacheConfig(self.block_size,
226 self.gpu_memory_utilization,
227 self.swap_space,
228 model_config.get_sliding_window())
229 parallel_config = ParallelConfig(self.pipeline_parallel_size,
230 self.tensor_parallel_size,
231 self.worker_use_ray,
232 self.max_parallel_loading_workers)

File /opt/conda/lib/python3.10/site-packages/vllm/config.py:101, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, download_dir, load_format, dtype, seed, revision, tokenizer_revision, max_model_len, quantization, enforce_eager, max_context_len_to_capture)
98 self.download_dir = model_path
99 self.tokenizer = model_path
--> 101 self.hf_config = get_config(self.model, trust_remote_code, revision)
102 self.dtype = _get_and_verify_dtype(self.hf_config, dtype)
103 self.max_model_len = _get_and_verify_max_len(self.hf_config,
104 max_model_len)

File /opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py:23, in get_config(model, trust_remote_code, revision)
19 def get_config(model: str,
20 trust_remote_code: bool,
21 revision: Optional[str] = None) -> PretrainedConfig:
22 try:
---> 23 config = AutoConfig.from_pretrained(
24 model, trust_remote_code=trust_remote_code, revision=revision)
25 except ValueError as e:
26 if (not trust_remote_code and
27 "requires you to execute the configuration file" in str(e)):

File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1098, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1096 return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
1097 elif "model_type" in config_dict:
-> 1098 config_class = CONFIG_MAPPING[config_dict["model_type"]]
1099 return config_class.from_dict(config_dict, **unused_kwargs)
1100 else:
1101 # Fallback: use pattern matching on the string.
1102 # We go from longer names to shorter names to catch roberta before bert (for instance)

File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:795, in _LazyConfigMapping.getitem(self, key)
793 return self._extra_content[key]
794 if key not in self._mapping:
--> 795 raise KeyError(key)
796 value = self._mapping[key]
797 module_name = model_type_to_module_name(key)

KeyError: 'qwen2'

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 |
| 0% 30C P8 16W / 150W | 0MiB / 22731MiB | 0% Default |

TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

When deploying Qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4 with vLLM v0.3.0, calling it via cURL as shown in README.md will result in the following error.

If stream=true is used, there won't be an error, but the output will be a series of "!" tokens. It seems like there is an error related to the vocabulary while decoding.

It's worth mentioning that the error only occurs in the 0.5B version; for other GPTQ Int4 versions like 1.8B, 4B, and 7B, this error does NOT occur.

Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f5e8e5380d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f5e8bdd2e90>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f5e8e5380d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f5e8bdd2e90>)>
Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/workspace/vllm/engine/async_llm_engine.py", line 409, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/workspace/vllm/engine/async_llm_engine.py", line 388, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 203, in step_async
    return self._process_model_outputs(output, scheduler_outputs)
  File "/workspace/vllm/engine/llm_engine.py", line 715, in _process_model_outputs
    self._process_sequence_group_outputs(seq_group, outputs)
  File "/workspace/vllm/engine/llm_engine.py", line 586, in _process_sequence_group_outputs
    self._decode_sequence(seq, seq_group.sampling_params)
  File "/workspace/vllm/engine/llm_engine.py", line 891, in _decode_sequence
    read_offset) = detokenize_incrementally(
  File "/workspace/vllm/transformers_utils/tokenizer.py", line 221, in detokenize_incrementally
    new_text = tokenizer.convert_tokens_to_string(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 612, in convert_tokens_to_string
    return self.backend_tokenizer.decoder.decode(tokens)
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

[AttributeError: 'Qwen2Tokenizer' object has no attribute 'eod_id'] Can't finetune Qwen/Qwen1.5-0.5B-Chat

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

When you run with model Qwen/Qwen1.5-0.5B-Chat:

finetune/finetune_lora_single_gpu.sh

You get the error:

Traceback (most recent call last):
  File "/Qwentuning/Qwen/finetune.py", line 374, in <module>
    train()
  File "/Qwentuning/Qwen/finetune.py", line 328, in train
    tokenizer.pad_token_id = tokenizer.eod_id
AttributeError: 'Qwen2Tokenizer' object has no attribute 'eod_id'

期望行为 | Expected Behavior

Should train the model

复现方法 | Steps To Reproduce

  1. Linux environment
  2. pip install transformers peft deepspeed datasets
  3. Change the script finetune/finetune_lora_single_gpu.sh : model name to Qwen/Qwen1.5-0.5B-Chat
  4. Change path for the data
  5. Run the script with finetune/finetune_lora_single_gpu.sh

运行环境 | Environment

- OS: Linux
- Python: 3.10
- Transformers: 4.37.2
- PyTorch: 2.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.7

备注 | Anything else?

No response

use qwen1.5 with vllm default prompt template?

Hi I am using qwen1.5 72b with vllm with this on 4 A100 80G

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-72B-Chat --enforce-eager --tensor-parallel-size 4 --gpu-memory-utilization 1.0 --max-model-len 4096

Do I need to specify a template? Currently vllm tells me that it is using

INFO 02-12 14:59:44 serving_chat.py:260] Using default chat template:
INFO 02-12 14:59:44 serving_chat.py:260] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 02-12 14:59:44 serving_chat.py:260] You are a helpful assistant<|im_end|>
INFO 02-12 14:59:44 serving_chat.py:260] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 02-12 14:59:44 serving_chat.py:260] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 02-12 14:59:44 serving_chat.py:260] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 02-12 14:59:44 serving_chat.py:260] ' }}{% endif %}

请问Qwen1.5在prompt和stop token上和1代有变化吗?

我尝试了沿用以前部署Qwen-chat-14b的代码对Qwen1.5-chat-14b进行vllm部署,然后在RAG流程中沿用之前的请求方式。
发现回答结果有不少变得很长,甚至出现了一些类似"\n\n\n\n\n"的内容,或吐出了一些不太相关的内容
根据经验判断,要么是没有配置好prompt,要么是没有配置好stopword
请问Qwen1.5b的prompt、stopword等相比于上一代是否发生了变化?

用llama-factory full微调, 然后vllm启动, 在用API调用,发现推理停不下来

url = "/v1/chat/completions"

data = {
# "model": "/gpdata/ideal/download/llama-factory/qwen1.5-0.5B-Chat",
"model": "qwen",
# "model": "/gpdata/ideal/moneymarket/Qwen-7B-Chat/",
# "model": "/gpdata/ideal/moneymarket/ChatGLM3-6B-32k/",
"messages": openai_bot.messages_base + [
{
"role": "user",
"content": prompt
}
],
"temperature": 0,
"top_p": 1,
"max_length": 100,
"stream": False,
# "stop": ["<|im_start|>", "<|im_end|>"],
# "stop": ["151645","151643"]
"add_generation_prompt": True,
}

微调训练参数:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--stage sft
--do_train True
--model_name_or_path /gpdata/ideal/download/qwen1.5-0.5B-Chat
--finetuning_type full
--template qwen
--dataset_dir data
--dataset moneymarket
--cutoff_len 1024
--learning_rate 5e-05
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 100
--warmup_steps 0
--lora_rank 8
--lora_dropout 0.1
--lora_target q_proj,v_proj
--output_dir saves/Qwen1.5-0.5B-Chat/full/train_2024-02-18-16-37-04
--fp16 True
--val_size 0.1
--evaluation_strategy steps
--eval_steps 100
--load_best_model_at_end True
--plot_loss True

7B-Chat-GPTQ-Int4 "disable_exllma": true

Qwen1.5-7B-Chat-GPTQ-Int4需要在config.json中的"quantization_config"下的"exllama_config",加入"disable_exllma": true才不会报错:

{
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "quantization_config": {
    "batch_size": 1,
    "bits": 4,
    "block_name_to_quantize": null,
    "cache_block_outputs": true,
    "damp_percent": 0.01,
    "dataset": null,
    "desc_act": false,
    "exllama_config": {
      "version": 1
    },
    "group_size": 128,
    "max_input_length": null,
    "model_seqlen": null,
    "module_name_preceding_first_block": null,
    "modules_in_block_to_quantize": null,
    "pad_token_id": null,
    "quant_method": "gptq",
    "sym": true,
    "tokenizer": null,
    "true_sequential": true,
    "use_cuda_fp16": false,
    "use_exllama": true,
    "disable_exllma": true 
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.37.0",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

Exllamav2 Backend Support Plz.

Exllama2 currently has the best quantization strategy and the fastest inference speed except tensorrt-llm. In addition, its memory usage during inference is also minimal.

Can this efficient inference backend be supported?

vllm 运行 Qwen1.5-14B-Chat 报错 TypeError: 'type' object is not subscriptable

image

python3 -m vllm.entrypoints.openai.api_server --model /root/Qwen1.5-14B-Chat
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 25, in
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 22, in
TypeTokenIDs = list[int]
TypeError: 'type' object is not subscriptable

Failed to import transformers.models.qwen2.modeling_qwen2

(https://github.com/QwenLM/Qwen1.5#-hugging-face-transformers) 使用这个运行程序后,出现异常

(qwen1.5) root@test:/home/test/qwen1.5# python chat.py
Traceback (most recent call last):
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1364, in _get_module
return importlib.import_module("." + module_name, self.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1204, in _gcd_import
File "", line 1176, in _find_and_load
File "", line 1147, in _find_and_load_unlocked
File "", line 690, in _load_unlocked
File "", line 940, in exec_module
File "", line 241, in _call_with_frames_removed
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 49, in
from flash_attn import flash_attn_func, flash_attn_varlen_func
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/test/qwen1.5/chat.py", line 4, in
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
model_class = _get_model_class(config, cls._model_mapping)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 387, in _get_model_class
supported_models = model_mapping[type(config)]
~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 740, in getitem
return self._load_attr_from_module(model_type, model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 754, in _load_attr_from_module
return getattribute_from_module(self._modules[module_name], attr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 698, in getattribute_from_module
if hasattr(module, attr):
^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1354, in getattr
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1366, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.qwen2.modeling_qwen2 because of the following error (look up to see its traceback):
/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
(qwen1.5) root@test:/home/test/qwen1.5#

(qwen1.5) root@test:/home/test/qwen1.5# pip list
Package Version


accelerate 0.24.1
altair 5.2.0
attrs 23.2.0
blinker 1.7.0
cachetools 5.3.2
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
einops 0.7.0
filelock 3.13.1
flash-attn 2.5.0
fsspec 2024.2.0
gitdb 4.0.11
GitPython 3.1.41
huggingface-hub 0.20.3
idna 3.6
importlib-metadata 6.11.0
Jinja2 3.1.3
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
mpmath 1.3.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
packaging 23.2
pandas 2.2.0
Pillow 9.5.0
pip 23.3.1
protobuf 4.25.2
psutil 5.9.8
pyarrow 15.0.0
pydeck 0.8.1b0
Pygments 2.17.2
Pympler 1.0.1
python-dateutil 2.8.2
pytz 2024.1
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.31.0
rich 13.7.0
rpds-py 0.17.1
safetensors 0.4.2
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
some-package 0.1
streamlit 1.24.0
sympy 1.12
tenacity 8.2.3
tokenizers 0.15.1
toml 0.10.2
toolz 0.12.1
torch 2.2.0
torchaudio 2.2.0
torchvision 0.17.0
tornado 6.4
tqdm 4.66.1
transformers 4.37.2
transformers-stream-generator 0.0.4
triton 2.2.0
typing_extensions 4.9.0
tzdata 2023.4
tzlocal 4.3.1
urllib3 2.2.0
validators 0.22.0
watchdog 4.0.0
wheel 0.41.2
zipp 3.17.0

(qwen1.5) root@test:/home/test/qwen1.5# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Wed Feb 7 14:39:58 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 34% 55C P2 344W / 450W | 698MiB / 24564MiB | 94% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

Tokenizer size and embedding size mismatch

Hi, what is the correct way to get the vocab size from the tokenizer in HF?

I tried the following:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat-GPTQ-Int8")

tokenizer.vocab_size and len(tokenizer) give 151643 and 151646 whereas the model outputs 151936

AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用openai.py运行时,再调用报错,该如何解决?
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

期望行为 | Expected Behavior

正常调用

复现方法 | Steps To Reproduce

执行python openai.py

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

帮忙看下,感谢

千问1.5-api以及显存问题

想问下1.5版本部署还支持上一版的openai_api吗,还是必须要用vLLM和SGlang了?其次想问一下72B_int4模型大概需要多大的显存,输入长度怎么样?我之前下了int8模型,在80Ga100上可以部署,但是输入长度超过一两百就显存不够了。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.