qwenlm / qwen1.5 Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 171.0 1.05 MB

Qwen1.5 is the improved version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.

Shell 100.00%

qwen1.5's People

Contributors

Stargazers

Watchers

Forkers

mz0in hertera1 ftgreat yumianhuli1 jianxinma tutumomo stjordanis lalomorales22 hhy5277 jimmyleesnow hzhuangdy shadowboxx allwavemedia farizov winning1120xx m13810859213 li-xiu-qi mitzen jiutian12 zhangxinyi0529 z1zi kennethstarkrl misterye alex2014git hzhwcmhf jadeluo songym2020 ivan-meer hemin110 yijia2413 guoqiangjia osanseviero ganeshkrishnan1 michaelvll binweiwu sunnylee0101 raghunandanbhat wangxince peixikk hongdangshao marclx openvino-dev-contest bug-orz lili0710432 u1tocapasmingsb wxjiao edmundyanj 2132660698 dogpandacat fdzl22 hshanghai gurpreetkaurjethra bestpredicts tuhahaha lzyhn qiaolf927 ashnoorsingh lihuibng haiasd freddiexv hulk006 zhangzhuobys 460130107 qiaqialuguo brainwanderlab ericsongyl aixia121 hzzhang-nlp zpeng1989 onecloner robinschen whitewum zhaopufeng sirliuyang linquanzhi wanghongqu f901107 rocke2020 dsdanielpark nymbo trocker anacronic-io kghamilton89 xiamenhappy peererror hhd1519 lilyshing thelongestusernameofall sharpboy2008 oceanlc areafather great-wind jeffsjf bluewhiteheart weyayang strawberryblue mumumiao keymao eltociear xiaox1311

qwen1.5's Issues

运行API服务时提示： Tokenizer class Qwen2Tokenizer does not exist or is not currently imported.

用 openai_api.py 运行API服务的时候报错
模型选择的是：Qwen1.5-14B-Chat-GPTQ-Int8
python3 openai_api.py --checkpoint-path /root/Qwen1.5-14B-Chat-GPTQ-Int8 --server-port 18000

Traceback (most recent call last):
File "openai_api.py", line 574, in
tokenizer = AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 724, in from_pretrained
raise ValueError(
ValueError: Tokenizer class Qwen2Tokenizer does not exist or is not currently imported.

提示 Qwen2Tokenizer 不存在，请问如何解决，谢谢

vllm deploy failed

**deploy platform：**RTX3090-24G
**model：**Qwen1.5-7B-Chat-GPTQ-Int4
bash:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --served-model-name qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --quantization gptq

error:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24320). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

Where are the provided GGUF files?

VLLM还不支持Qwen1.5是吗？

起始日期 | Start Date

No response

实现PR | Implementation PR

VLLM还不支持Qwen2ForCausalLM是吗？

摘要 | Summary

VLLM还不支持Qwen2ForCausalLM是吗？

基本示例 | Basic Example

VLLM还不支持Qwen2ForCausalLM是吗？

缺陷 | Drawbacks

VLLM还不支持Qwen2ForCausalLM是吗？

未解决问题 | Unresolved questions

No response

1.5-72B-Chat 显存占用270GB，推理速度只有3 Tokens/s

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

1.5-72B-Chat 显存占用是72B-Chat的一倍，性能只有72B-Chat的一半，模型参数显示"torch_dtype": "bfloat16"，实际加载完打印出来是：float32
模型地址：https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/files

参数文件地址：https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/file/view/master/config.json?status=1
{
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 24576,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 64,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.37.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
模型加载完，打印实际参数类型：

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

#chat_demo_72B-1.5.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os, time
from tqdm import tqdm
import pdb

device = "cuda"
model_path = "/data/shared/Qwen1.5/Qwen1.5-72B-Chat"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", use_cache=True)

total_memory_allocated = 0
device_count = torch.cuda.device_count()
for device_id in range(device_count):
torch.cuda.set_device(device_id)
total_memory_allocated += torch.cuda.memory_allocated(device_id)
total_memory_allocated_gb = total_memory_allocated / (1024**3)
print(f"\nTotal memory allocated across all visible CUDA devices: {total_memory_allocated_gb} GB\n")

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [
{"role": "system", "content": "你是一个有用的助手。"},
]

while True:
user_input = input("\nUser: ")
if user_input == "quit":
print("exit")
break
messages.append({"role": "user", "content": user_input})
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
pbar = tqdm(total=1, desc="生成中", leave=True)
start_time = time.time()
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512,
pad_token_id=tokenizer.eos_token_id
)
end_time = time.time()
generation_time = end_time - start_time
pbar.update(1)
pbar.close()
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]


print("\nQwen-Chat: ", response)
messages.append({"role": "system", "content": response})
total_generated_tokens = sum(len(ids) for ids in generated_ids)
total_characters = len(response)
chars_per_second = total_characters / generation_time
tokens_per_second = total_generated_tokens / generation_time
print(f"Response length: {len(response)} characters")
print(f"Response time: {generation_time:.2f}s")
print(f"Characters per second: {chars_per_second:.2f}")
print(f"Tokens per second: {tokens_per_second:.2f}")

运行环境 | Environment

- OS:Alibaba Cloud Linux release 3 (Soaring Falcon)
- Python:Python 3.8.10
- Transformers:4.37.2
- PyTorch:2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.7

备注 | Anything else?

No response

有同学部署qwen1.5可以正常聊天吗

使用FastChat部署的api/web，chat/completion 都会提前结束，看起来像是未生成完，apply_chat_template后使用completion也会

使用vllm.entrypoints.openai.api_server会生成很多\n，尝试了issue里的方法还没有发现work的,直接把\n作为stop有点顾此失彼,不过qwen/Qwen1.5-14B-Chat没有这个情况，qwen/Qwen1.5-14B-Chat-GPTQ-Int4偶尔有, qwen/Qwen1.5-72B-Chat-GPTQ-Int4严重,4B-Chat和GPTQ-Int4严重

messages = [{"role": "user", "content": "你是什么模型？"}]
response = openai.ChatCompletion.create(
    model=model_id,
    messages=messages,
    top_k=50,
    top_p=0.7,
    repetition_penalty=1,
    temperature=0.7,
    max_tokens=512,
    stream=True
)

Bad result in Qwen1.5-0.5B-Chat-GGUF

same setting in transformer

Who knows what GQA and SWA refer to?

These two terms appear in the README.md:

We have not integrated GQA and mixture of SWA and full attention in this version and we will add the features in the future version.

How to enable flash attention 2?

Wonderful work!
I wonder that whether you train model with flash attention 2 or not? And how to enable flash attention 2?
Have you test flash attention 2 will make model performance degradation?

multi gpu dpo training have an error：Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Traceback (most recent call last):
File "Sakura_DPO_qwen.py", line 339, in
fire.Fire(train)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "Sakura_DPO_qwen.py", line 329, in train
dpo_trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2772, in training_step
loss = self.compute_loss(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 1055, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 996, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 960, in concatenated_forward
all_logits = model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
loss = self.module(*inputs, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/peft_model.py", line 1083, in forward
return self.base_model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
outputs = self.model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1003, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

有没有计划开发30B左右模型

Hi, 如题，非常感谢～

raise ValueError("No available memory for the cache blocks. " ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

python -m vllm.entrypoints.openai.api_server --model data/Qwen1.5-14B-Chat --host 0.0.0.0 --port 30005 --dtype half --gpu-memory-utilization 0.95

报这个错是什么原因，用的是32GB显存V100

没有function调用了吗？im_start_id 和 im_end_id 这些都没有了？

使用vllm.entrypoints.openai.api_server启动模型时，调用API输出结果不正确

vLLM
We advise you to use vLLM>=0.3.0 to build OpenAI-compatible API service. Start the server with a chat model, e.g. Qwen1.5-7B-Chat:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat

Then use the chat API as demonstrated below:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "你是谁？"}
    ]
    }'

输出结果结尾有很多'\n\n\n'

在调用接口时增加参数"add_generation_prompt":True解决该问题。

finetune模型时支持最大的token数量是多少呢？

如题，谢谢

AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

尝试使用Qwen1.5-7B-Chat-GPTQ-Int8模型进行聊天，遇到
(欢迎使用 Qwen-Chat 模型，输入内容即可进行对话，:h 显示命令帮助。)

Note: This demo is governed by the original license of Qwen.
We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.
(注：本演示受Qwen的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)

User> 自我介绍一下？
Traceback (most recent call last):
File "D:\work\AI\HK_Integration\Qwen\cli_demo.py", line 210, in
main()
File "D:\work\AI\HK_Integration\Qwen\cli_demo.py", line 198, in main
for response in model.chat_stream(tokenizer, query, history=history, generation_config=config):
File "D:\software\Anaconda3\envs\HK_Integration_env\lib\site-packages\torch\nn\modules\module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

具备所有依赖条件，运行cli_demo.py

运行环境 | Environment

- OS:windows10
- Python:3.8
- Transformers:4.37.0
- PyTorch:2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8

备注 | Anything else?

安官网公开的quick start 报错模型加载正常，生成报错

WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:151645 for open-end generation.

使用量化auto_gptq 4-bit 数值类型，各个模型所需要的最小显存是多少。

chat 模型不再提供chat方法了么？

AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

原来为了方便和通用其他模型，用的chat方法，然而qwen1.5提示没有chat方法了？

what's the difference of <|im_end|> and <|endoftext|>?

请删除这个issue

如题

Stuck when generating long text/生成长文本时卡在生成generate_ids步骤

When I input a long prompt to the model, it gets stuck in the generation step. The GPU usage increases briefly and then falls back, but the memory occupied by the tokens is not released, and it just stays stuck there. According to the code, it seems to be stuck at this step:
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
This problem occurs when the text length reaches about 400~500 tokens. Short texts can be generated normally. I reproduced this problem on both the official AWQ quantized model and my own fine-tuned model.

向模型输入了长度比较长的prompt后会卡在输出那步，显卡占用短暂升高然后回落，token占用的显存也没有释放，然后就一直卡在那一步了。看代码的话应该是卡在了
generated_ids = model.generate(
                model_inputs.input_ids,
                max_new_tokens=512
            )
这一步上。大概文本长度达到4、500的时候就会出现。短文本（200tokens以内）能正常生成回复。我在官方提供的AWQ量化模型和自己微调的模型上都复现了这个问题。

vLLM 中的例子跑不通，返回 {"object":"error","message":"The model `Qwen/Qwen1.5-7B-Chat` does not exist.","type":"NotFoundError","param":null,"code":404}

运行示例代码，出现如下提示，该怎么处理？

2024-02-07 06:36:08,182 - modelscope - INFO - PyTorch version 1.12.1 Found.
2024-02-07 06:36:08,183 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2024-02-07 06:36:08,221 - modelscope - INFO - Loading done! Current index file version is 1.12.0, with md5 155281490ae47dfe7d8f4ba91b079bfc and a total number of 964 components indexed
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:151645 for open-end generation.

Qwen1.5-7B-Chat 最佳学习实践教程~

Qwen1.5实测确实表现很优秀，hhh。火速更新了一手Qwen1.5的学习实践教程~ 大家可以春节愉快的做卷王了!

项目地址：https://github.com/datawhalechina/self-llm.git

docker镜像麻烦跟进更新一下谢谢

KeyError: 'qwen2' Successfully installed transformers-4.37.2 vllm

use modelscope jupiter:
!pip install modelscope
import os
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
!pip install vllm
!pip install --upgrade transformers==4.37.2

  ms-swift 1.5.1 requires transformers<4.37,>=4.33, but you have transformers 4.37.2 which is incompatible.
  Successfully installed transformers-4.37.2

from vllm import LLM
llm = LLM(model="Qwen/Qwen1.5-0.5B",trust_remote_code=True,gpu_memory_utilization=0.95) # Create an LLM.
prompts='叶文杰在三体里的角色'
outputs = llm.generate(prompts)

qwen model has downloaded from modelscope,then get following error:

KeyError Traceback (most recent call last)
Cell In[21], line 2
1 from vllm import LLM
----> 2 llm = LLM(model="Qwen/Qwen1.5-0.5B",trust_remote_code=True,gpu_memory_utilization=0.95) # Create an LLM.
3 prompts='叶文杰在三体里的角色'
4 outputs = llm.generate(prompts) # Generate texts from the prompts.

File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:105, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, **kwargs)
87 kwargs["disable_log_stats"] = True
88 engine_args = EngineArgs(
89 model=model,
90 tokenizer=tokenizer,
(...)
103 **kwargs,
104 )
--> 105 self.llm_engine = LLMEngine.from_engine_args(engine_args)
106 self.request_counter = Counter()

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:304, in LLMEngine.from_engine_args(cls, engine_args)
302 """Creates an LLM engine from the engine arguments."""
303 # Create the engine configs.
--> 304 engine_configs = engine_args.create_engine_configs()
305 parallel_config = engine_configs[2]
306 # Initialize the cluster.

File /opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py:218, in EngineArgs.create_engine_configs(self)
215 def create_engine_configs(
216 self,
217 ) -> Tuple[ModelConfig, CacheConfig, ParallelConfig, SchedulerConfig]:
--> 218 model_config = ModelConfig(self.model, self.tokenizer,
219 self.tokenizer_mode, self.trust_remote_code,
220 self.download_dir, self.load_format,
221 self.dtype, self.seed, self.revision,
222 self.tokenizer_revision, self.max_model_len,
223 self.quantization, self.enforce_eager,
224 self.max_context_len_to_capture)
225 cache_config = CacheConfig(self.block_size,
226 self.gpu_memory_utilization,
227 self.swap_space,
228 model_config.get_sliding_window())
229 parallel_config = ParallelConfig(self.pipeline_parallel_size,
230 self.tensor_parallel_size,
231 self.worker_use_ray,
232 self.max_parallel_loading_workers)

File /opt/conda/lib/python3.10/site-packages/vllm/config.py:101, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, download_dir, load_format, dtype, seed, revision, tokenizer_revision, max_model_len, quantization, enforce_eager, max_context_len_to_capture)
98 self.download_dir = model_path
99 self.tokenizer = model_path
--> 101 self.hf_config = get_config(self.model, trust_remote_code, revision)
102 self.dtype = _get_and_verify_dtype(self.hf_config, dtype)
103 self.max_model_len = _get_and_verify_max_len(self.hf_config,
104 max_model_len)

File /opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py:23, in get_config(model, trust_remote_code, revision)
19 def get_config(model: str,
20 trust_remote_code: bool,
21 revision: Optional[str] = None) -> PretrainedConfig:
22 try:
---> 23 config = AutoConfig.from_pretrained(
24 model, trust_remote_code=trust_remote_code, revision=revision)
25 except ValueError as e:
26 if (not trust_remote_code and
27 "requires you to execute the configuration file" in str(e)):

File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1098, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1096 return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
1097 elif "model_type" in config_dict:
-> 1098 config_class = CONFIG_MAPPING[config_dict["model_type"]]
1099 return config_class.from_dict(config_dict, **unused_kwargs)
1100 else:
1101 # Fallback: use pattern matching on the string.
1102 # We go from longer names to shorter names to catch roberta before bert (for instance)

File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:795, in _LazyConfigMapping.getitem(self, key)
793 return self._extra_content[key]
794 if key not in self._mapping:
--> 795 raise KeyError(key)
796 value = self._mapping[key]
797 module_name = model_type_to_module_name(key)

KeyError: 'qwen2'

TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

When deploying Qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4 with vLLM v0.3.0, calling it via cURL as shown in README.md will result in the following error.

If stream=true is used, there won't be an error, but the output will be a series of "!" tokens. It seems like there is an error related to the vocabulary while decoding.

It's worth mentioning that the error only occurs in the 0.5B version; for other GPTQ Int4 versions like 1.8B, 4B, and 7B, this error does NOT occur.

Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f5e8e5380d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f5e8bdd2e90>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f5e8e5380d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f5e8bdd2e90>)>
Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/workspace/vllm/engine/async_llm_engine.py", line 409, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/workspace/vllm/engine/async_llm_engine.py", line 388, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 203, in step_async
    return self._process_model_outputs(output, scheduler_outputs)
  File "/workspace/vllm/engine/llm_engine.py", line 715, in _process_model_outputs
    self._process_sequence_group_outputs(seq_group, outputs)
  File "/workspace/vllm/engine/llm_engine.py", line 586, in _process_sequence_group_outputs
    self._decode_sequence(seq, seq_group.sampling_params)
  File "/workspace/vllm/engine/llm_engine.py", line 891, in _decode_sequence
    read_offset) = detokenize_incrementally(
  File "/workspace/vllm/transformers_utils/tokenizer.py", line 221, in detokenize_incrementally
    new_text = tokenizer.convert_tokens_to_string(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 612, in convert_tokens_to_string
    return self.backend_tokenizer.decoder.decode(tokens)
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

tokenizer应该使用AutoTokenizer还是Qwen2TokenizerFast呢

如标题

[AttributeError: 'Qwen2Tokenizer' object has no attribute 'eod_id'] Can't finetune Qwen/Qwen1.5-0.5B-Chat

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

When you run with model Qwen/Qwen1.5-0.5B-Chat:

finetune/finetune_lora_single_gpu.sh

You get the error:

Traceback (most recent call last):
  File "/Qwentuning/Qwen/finetune.py", line 374, in <module>
    train()
  File "/Qwentuning/Qwen/finetune.py", line 328, in train
    tokenizer.pad_token_id = tokenizer.eod_id
AttributeError: 'Qwen2Tokenizer' object has no attribute 'eod_id'

期望行为 | Expected Behavior

Should train the model

复现方法 | Steps To Reproduce

Linux environment
pip install transformers peft deepspeed datasets
Change the script finetune/finetune_lora_single_gpu.sh : model name to Qwen/Qwen1.5-0.5B-Chat
Change path for the data
Run the script with finetune/finetune_lora_single_gpu.sh

运行环境 | Environment

- OS: Linux
- Python: 3.10
- Transformers: 4.37.2
- PyTorch: 2.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.7

备注 | Anything else?

No response

Model is getting loaded very slow onto gpu

Model is getting loaded very slow with new code compared earlier custom modelling code.

Made a example project that can train qwen1.5 using the padding-free training and Multipack Sampler

This is an example project to test how to use padding-free training and Multipack Sampler from openchat, achieving a 3~10x speedup compared to the conventional padded training.

https://github.com/Sanster/padding_free_llm_train

use qwen1.5 with vllm default prompt template?

Hi I am using qwen1.5 72b with vllm with this on 4 A100 80G

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-72B-Chat --enforce-eager --tensor-parallel-size 4 --gpu-memory-utilization 1.0 --max-model-len 4096

Do I need to specify a template? Currently vllm tells me that it is using

INFO 02-12 14:59:44 serving_chat.py:260] Using default chat template:
INFO 02-12 14:59:44 serving_chat.py:260] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 02-12 14:59:44 serving_chat.py:260] You are a helpful assistant<|im_end|>
INFO 02-12 14:59:44 serving_chat.py:260] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 02-12 14:59:44 serving_chat.py:260] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 02-12 14:59:44 serving_chat.py:260] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 02-12 14:59:44 serving_chat.py:260] ' }}{% endif %}

请问：72B模型量化后，和7B/14B模型对比，效果怎么样呢？

请教一下：和不量化的7B/14B模型对比，72B模型量化后，效果怎么样呢？是量化的72B会差很多，还是完全没有可比性呢？

在昇腾910A测试了qwen1.5-14B-chat推理，速度慢，结果是无意义的混乱字符

Tokenizer应该和1.0通用吧？

请问Qwen1.5在prompt和stop token上和1代有变化吗？

我尝试了沿用以前部署Qwen-chat-14b的代码对Qwen1.5-chat-14b进行vllm部署，然后在RAG流程中沿用之前的请求方式。
发现回答结果有不少变得很长，甚至出现了一些类似"\n\n\n\n\n"的内容，或吐出了一些不太相关的内容
根据经验判断，要么是没有配置好prompt，要么是没有配置好stopword
请问Qwen1.5b的prompt、stopword等相比于上一代是否发生了变化？

用llama-factory full微调, 然后vllm启动, 在用API调用,发现推理停不下来

url = "/v1/chat/completions"

data = {
# "model": "/gpdata/ideal/download/llama-factory/qwen1.5-0.5B-Chat",
"model": "qwen",
# "model": "/gpdata/ideal/moneymarket/Qwen-7B-Chat/",
# "model": "/gpdata/ideal/moneymarket/ChatGLM3-6B-32k/",
"messages": openai_bot.messages_base + [
{
"role": "user",
"content": prompt
}
],
"temperature": 0,
"top_p": 1,
"max_length": 100,
"stream": False,
# "stop": ["<|im_start|>", "<|im_end|>"],
# "stop": ["151645","151643"]
"add_generation_prompt": True,
}

微调训练参数:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--stage sft
--do_train True
--model_name_or_path /gpdata/ideal/download/qwen1.5-0.5B-Chat
--finetuning_type full
--template qwen
--dataset_dir data
--dataset moneymarket
--cutoff_len 1024
--learning_rate 5e-05
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 100
--warmup_steps 0
--lora_rank 8
--lora_dropout 0.1
--lora_target q_proj,v_proj
--output_dir saves/Qwen1.5-0.5B-Chat/full/train_2024-02-18-16-37-04
--fp16 True
--val_size 0.1
--evaluation_strategy steps
--eval_steps 100
--load_best_model_at_end True
--plot_loss True

7B-Chat-GPTQ-Int4 "disable_exllma": true

Qwen1.5-7B-Chat-GPTQ-Int4需要在config.json中的"quantization_config"下的"exllama_config"，加入"disable_exllma": true才不会报错：

{
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "quantization_config": {
    "batch_size": 1,
    "bits": 4,
    "block_name_to_quantize": null,
    "cache_block_outputs": true,
    "damp_percent": 0.01,
    "dataset": null,
    "desc_act": false,
    "exllama_config": {
      "version": 1
    },
    "group_size": 128,
    "max_input_length": null,
    "model_seqlen": null,
    "module_name_preceding_first_block": null,
    "modules_in_block_to_quantize": null,
    "pad_token_id": null,
    "quant_method": "gptq",
    "sym": true,
    "tokenizer": null,
    "true_sequential": true,
    "use_cuda_fp16": false,
    "use_exllama": true,
    "disable_exllma": true 
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.37.0",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

Exllamav2 Backend Support Plz.

Exllama2 currently has the best quantization strategy and the fastest inference speed except tensorrt-llm. In addition, its memory usage during inference is also minimal.

Can this efficient inference backend be supported?

vllm 运行 Qwen1.5-14B-Chat 报错 TypeError: 'type' object is not subscriptable

python3 -m vllm.entrypoints.openai.api_server --model /root/Qwen1.5-14B-Chat
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 25, in
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 22, in
TypeTokenIDs = list[int]
TypeError: 'type' object is not subscriptable

qwen:0.5b run with ollama have a question: Reply exception, stuck in a loop.

copy issue, ollama/ollama#2405

Definitely worth 10k stars!!!

Failed to import transformers.models.qwen2.modeling_qwen2

(https://github.com/QwenLM/Qwen1.5#-hugging-face-transformers) 使用这个运行程序后，出现异常

(qwen1.5) root@test:/home/test/qwen1.5# python chat.py
Traceback (most recent call last):
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1364, in _get_module
return importlib.import_module("." + module_name, self.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1204, in _gcd_import
File "", line 1176, in _find_and_load
File "", line 1147, in _find_and_load_unlocked
File "", line 690, in _load_unlocked
File "", line 940, in exec_module
File "", line 241, in _call_with_frames_removed
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 49, in
from flash_attn import flash_attn_func, flash_attn_varlen_func
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/test/qwen1.5/chat.py", line 4, in
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
model_class = _get_model_class(config, cls._model_mapping)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 387, in _get_model_class
supported_models = model_mapping[type(config)]
~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 740, in getitem
return self._load_attr_from_module(model_type, model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 754, in _load_attr_from_module
return getattribute_from_module(self._modules[module_name], attr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 698, in getattribute_from_module
if hasattr(module, attr):
^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1354, in getattr
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1366, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.qwen2.modeling_qwen2 because of the following error (look up to see its traceback):
/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
(qwen1.5) root@test:/home/test/qwen1.5#

(qwen1.5) root@test:/home/test/qwen1.5# pip list
Package Version

accelerate 0.24.1
altair 5.2.0
attrs 23.2.0
blinker 1.7.0
cachetools 5.3.2
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
einops 0.7.0
filelock 3.13.1
flash-attn 2.5.0
fsspec 2024.2.0
gitdb 4.0.11
GitPython 3.1.41
huggingface-hub 0.20.3
idna 3.6
importlib-metadata 6.11.0
Jinja2 3.1.3
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
mpmath 1.3.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
packaging 23.2
pandas 2.2.0
Pillow 9.5.0
pip 23.3.1
protobuf 4.25.2
psutil 5.9.8
pyarrow 15.0.0
pydeck 0.8.1b0
Pygments 2.17.2
Pympler 1.0.1
python-dateutil 2.8.2
pytz 2024.1
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.31.0
rich 13.7.0
rpds-py 0.17.1
safetensors 0.4.2
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
some-package 0.1
streamlit 1.24.0
sympy 1.12
tenacity 8.2.3
tokenizers 0.15.1
toml 0.10.2
toolz 0.12.1
torch 2.2.0
torchaudio 2.2.0
torchvision 0.17.0
tornado 6.4
tqdm 4.66.1
transformers 4.37.2
transformers-stream-generator 0.0.4
triton 2.2.0
typing_extensions 4.9.0
tzdata 2023.4
tzlocal 4.3.1
urllib3 2.2.0
validators 0.22.0
watchdog 4.0.0
wheel 0.41.2
zipp 3.17.0

(qwen1.5) root@test:/home/test/qwen1.5# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Tokenizer size and embedding size mismatch

Hi, what is the correct way to get the vocab size from the tokenizer in HF?

I tried the following:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat-GPTQ-Int8")

tokenizer.vocab_size and len(tokenizer) give 151643 and 151646 whereas the model outputs 151936

💡 The finetuning of Qwen1.5 models is supported by SWIFT framework of ModelScope community.

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

摘要 | Summary

SWIFT is a light-weighted training framework of ModelScope community. You can visit our github website:

https://github.com/modelscope/swift

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

No response

generate 的时候出现 The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. 这是什么意思，有什么影响？

AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用openai.py运行时，再调用报错，该如何解决？
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

期望行为 | Expected Behavior

正常调用

复现方法 | Steps To Reproduce

执行python openai.py

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

帮忙看下，感谢

qwenlm / qwen1.5 Goto Github PK

qwen1.5's People

Contributors

Stargazers

Watchers

Forkers

qwen1.5's Issues

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

use modelscope jupiter: !pip install modelscope import os os.environ['VLLM_USE_MODELSCOPE'] = 'True' !pip install vllm !pip install --upgrade transformers==4.37.2

from vllm import LLM llm = LLM(model="Qwen/Qwen1.5-0.5B",trust_remote_code=True,gpu_memory_utilization=0.95) # Create an LLM. prompts='叶文杰在三体里的角色' outputs = llm.generate(prompts)

qwen model has downloaded from modelscope,then get following error:

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Recommend Projects

Recommend Topics

Recommend Org

use modelscope jupiter:
!pip install modelscope
import os
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
!pip install vllm
!pip install --upgrade transformers==4.37.2

from vllm import LLM
llm = LLM(model="Qwen/Qwen1.5-0.5B",trust_remote_code=True,gpu_memory_utilization=0.95) # Create an LLM.
prompts='叶文杰在三体里的角色'
outputs = llm.generate(prompts)