qwenlm / qwen1.5 Goto Github PK
View Code? Open in Web Editor NEWQwen1.5 is the improved version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.
Qwen1.5 is the improved version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.
用 openai_api.py 运行API服务的时候报错
模型选择的是:Qwen1.5-14B-Chat-GPTQ-Int8
python3 openai_api.py --checkpoint-path /root/Qwen1.5-14B-Chat-GPTQ-Int8 --server-port 18000
Traceback (most recent call last):
File "openai_api.py", line 574, in
tokenizer = AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 724, in from_pretrained
raise ValueError(
ValueError: Tokenizer class Qwen2Tokenizer does not exist or is not currently imported.
提示 Qwen2Tokenizer 不存在,请问如何解决,谢谢
**deploy platform:**RTX3090-24G
**model:**Qwen1.5-7B-Chat-GPTQ-Int4
bash:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --served-model-name qwen/Qwen1.5-7B-Chat-GPTQ-Int4 --quantization gptq
error:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24320). Try increasing
gpu_memory_utilizationor decreasing
max_model_len when initializing the engine.
No response
VLLM还不支持Qwen2ForCausalLM是吗?
No response
VLLM还不支持Qwen2ForCausalLM是吗?
VLLM还不支持Qwen2ForCausalLM是吗?
VLLM还不支持Qwen2ForCausalLM是吗?
No response
1.5-72B-Chat 显存占用是72B-Chat的一倍,性能只有72B-Chat的一半,模型参数显示"torch_dtype": "bfloat16",实际加载完打印出来是:float32
模型地址:https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/files
参数文件地址:https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/file/view/master/config.json?status=1
{
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 24576,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 64,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.37.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
模型加载完,打印实际参数类型:
No response
#chat_demo_72B-1.5.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os, time
from tqdm import tqdm
import pdb
device = "cuda"
model_path = "/data/shared/Qwen1.5/Qwen1.5-72B-Chat"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", use_cache=True)
total_memory_allocated = 0
device_count = torch.cuda.device_count()
for device_id in range(device_count):
torch.cuda.set_device(device_id)
total_memory_allocated += torch.cuda.memory_allocated(device_id)
total_memory_allocated_gb = total_memory_allocated / (1024**3)
print(f"\nTotal memory allocated across all visible CUDA devices: {total_memory_allocated_gb} GB\n")
tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [
{"role": "system", "content": "你是一个有用的助手。"},
]
while True:
user_input = input("\nUser: ")
if user_input == "quit":
print("exit")
break
messages.append({"role": "user", "content": user_input})
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
pbar = tqdm(total=1, desc="生成中", leave=True)
start_time = time.time()
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512,
pad_token_id=tokenizer.eos_token_id
)
end_time = time.time()
generation_time = end_time - start_time
pbar.update(1)
pbar.close()
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("\nQwen-Chat: ", response)
messages.append({"role": "system", "content": response})
total_generated_tokens = sum(len(ids) for ids in generated_ids)
total_characters = len(response)
chars_per_second = total_characters / generation_time
tokens_per_second = total_generated_tokens / generation_time
print(f"Response length: {len(response)} characters")
print(f"Response time: {generation_time:.2f}s")
print(f"Characters per second: {chars_per_second:.2f}")
print(f"Tokens per second: {tokens_per_second:.2f}")
- OS:Alibaba Cloud Linux release 3 (Soaring Falcon)
- Python:Python 3.8.10
- Transformers:4.37.2
- PyTorch:2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.7
No response
使用FastChat
部署的api/web,chat/completion 都会提前结束,看起来像是未生成完,apply_chat_template
后使用completion也会
使用vllm.entrypoints.openai.api_server
会生成很多\n
,尝试了issue里的方法还没有发现work的,直接把\n
作为stop有点顾此失彼,不过qwen/Qwen1.5-14B-Chat
没有这个情况,qwen/Qwen1.5-14B-Chat-GPTQ-Int4
偶尔有, qwen/Qwen1.5-72B-Chat-GPTQ-Int4
严重,4B-Chat
和GPTQ-Int4严重
messages = [{"role": "user", "content": "你是什么模型?"}]
response = openai.ChatCompletion.create(
model=model_id,
messages=messages,
top_k=50,
top_p=0.7,
repetition_penalty=1,
temperature=0.7,
max_tokens=512,
stream=True
)
These two terms appear in the README.md:
We have not integrated GQA and mixture of SWA and full attention in this version and we will add the features in the future version.
Wonderful work!
I wonder that whether you train model with flash attention 2 or not? And how to enable flash attention 2?
Have you test flash attention 2 will make model performance degradation?
Traceback (most recent call last):
File "Sakura_DPO_qwen.py", line 339, in
fire.Fire(train)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "Sakura_DPO_qwen.py", line 329, in train
dpo_trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2772, in training_step
loss = self.compute_loss(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 1055, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 996, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/dpo_trainer.py", line 960, in concatenated_forward
all_logits = model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
loss = self.module(*inputs, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/peft_model.py", line 1083, in forward
return self.base_model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
outputs = self.model(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1003, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
Hi, 如题,非常感谢~
vLLM
We advise you to use vLLM>=0.3.0 to build OpenAI-compatible API service. Start the server with a chat model, e.g. Qwen1.5-7B-Chat:
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat
Then use the chat API as demonstrated below:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-7B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "你是谁?"}
]
}'
输出结果结尾有很多'\n\n\n'
在调用接口时增加参数"add_generation_prompt":True
解决该问题。
如题,谢谢
尝试使用Qwen1.5-7B-Chat-GPTQ-Int8模型进行聊天,遇到
(欢迎使用 Qwen-Chat 模型,输入内容即可进行对话,:h 显示命令帮助。)
Note: This demo is governed by the original license of Qwen.
We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.
(注:本演示受Qwen的许可协议限制。我们强烈建议,用户不应传播及不应允许他人传播以下内容,包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)
User> 自我介绍一下?
Traceback (most recent call last):
File "D:\work\AI\HK_Integration\Qwen\cli_demo.py", line 210, in
main()
File "D:\work\AI\HK_Integration\Qwen\cli_demo.py", line 198, in main
for response in model.chat_stream(tokenizer, query, history=history, generation_config=config):
File "D:\software\Anaconda3\envs\HK_Integration_env\lib\site-packages\torch\nn\modules\module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'
.
具备所有依赖条件,运行cli_demo.py
- OS:windows10
- Python:3.8
- Transformers:4.37.0
- PyTorch:2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8
.
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:151645 for open-end generation.
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'
原来为了方便和通用其他模型,用的chat方法,然而qwen1.5提示没有chat方法了?
如题
When I input a long prompt to the model, it gets stuck in the generation step. The GPU usage increases briefly and then falls back, but the memory occupied by the tokens is not released, and it just stays stuck there. According to the code, it seems to be stuck at this step:
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
This problem occurs when the text length reaches about 400~500 tokens. Short texts can be generated normally. I reproduced this problem on both the official AWQ quantized model and my own fine-tuned model.
向模型输入了长度比较长的prompt后会卡在输出那步,显卡占用短暂升高然后回落,token占用的显存也没有释放,然后就一直卡在那一步了。看代码的话应该是卡在了
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
这一步上。大概文本长度达到4、500的时候就会出现。短文本(200tokens以内)能正常生成回复。我在官方提供的AWQ量化模型和自己微调的模型上都复现了这个问题。
2024-02-07 06:36:08,182 - modelscope - INFO - PyTorch version 1.12.1 Found.
2024-02-07 06:36:08,183 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2024-02-07 06:36:08,221 - modelscope - INFO - Loading done! Current index file version is 1.12.0, with md5 155281490ae47dfe7d8f4ba91b079bfc and a total number of 964 components indexed
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:151645 for open-end generation.
Qwen1.5实测确实表现很优秀,hhh。火速更新了一手Qwen1.5的学习实践教程~ 大家可以春节愉快的做卷王了!
docker镜像麻烦跟进更新一下谢谢
ms-swift 1.5.1 requires transformers<4.37,>=4.33, but you have transformers 4.37.2 which is incompatible.
Successfully installed transformers-4.37.2
KeyError Traceback (most recent call last)
Cell In[21], line 2
1 from vllm import LLM
----> 2 llm = LLM(model="Qwen/Qwen1.5-0.5B",trust_remote_code=True,gpu_memory_utilization=0.95) # Create an LLM.
3 prompts='叶文杰在三体里的角色'
4 outputs = llm.generate(prompts) # Generate texts from the prompts.
File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:105, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, **kwargs)
87 kwargs["disable_log_stats"] = True
88 engine_args = EngineArgs(
89 model=model,
90 tokenizer=tokenizer,
(...)
103 **kwargs,
104 )
--> 105 self.llm_engine = LLMEngine.from_engine_args(engine_args)
106 self.request_counter = Counter()
File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:304, in LLMEngine.from_engine_args(cls, engine_args)
302 """Creates an LLM engine from the engine arguments."""
303 # Create the engine configs.
--> 304 engine_configs = engine_args.create_engine_configs()
305 parallel_config = engine_configs[2]
306 # Initialize the cluster.
File /opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py:218, in EngineArgs.create_engine_configs(self)
215 def create_engine_configs(
216 self,
217 ) -> Tuple[ModelConfig, CacheConfig, ParallelConfig, SchedulerConfig]:
--> 218 model_config = ModelConfig(self.model, self.tokenizer,
219 self.tokenizer_mode, self.trust_remote_code,
220 self.download_dir, self.load_format,
221 self.dtype, self.seed, self.revision,
222 self.tokenizer_revision, self.max_model_len,
223 self.quantization, self.enforce_eager,
224 self.max_context_len_to_capture)
225 cache_config = CacheConfig(self.block_size,
226 self.gpu_memory_utilization,
227 self.swap_space,
228 model_config.get_sliding_window())
229 parallel_config = ParallelConfig(self.pipeline_parallel_size,
230 self.tensor_parallel_size,
231 self.worker_use_ray,
232 self.max_parallel_loading_workers)
File /opt/conda/lib/python3.10/site-packages/vllm/config.py:101, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, download_dir, load_format, dtype, seed, revision, tokenizer_revision, max_model_len, quantization, enforce_eager, max_context_len_to_capture)
98 self.download_dir = model_path
99 self.tokenizer = model_path
--> 101 self.hf_config = get_config(self.model, trust_remote_code, revision)
102 self.dtype = _get_and_verify_dtype(self.hf_config, dtype)
103 self.max_model_len = _get_and_verify_max_len(self.hf_config,
104 max_model_len)
File /opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py:23, in get_config(model, trust_remote_code, revision)
19 def get_config(model: str,
20 trust_remote_code: bool,
21 revision: Optional[str] = None) -> PretrainedConfig:
22 try:
---> 23 config = AutoConfig.from_pretrained(
24 model, trust_remote_code=trust_remote_code, revision=revision)
25 except ValueError as e:
26 if (not trust_remote_code and
27 "requires you to execute the configuration file" in str(e)):
File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1098, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1096 return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
1097 elif "model_type" in config_dict:
-> 1098 config_class = CONFIG_MAPPING[config_dict["model_type"]]
1099 return config_class.from_dict(config_dict, **unused_kwargs)
1100 else:
1101 # Fallback: use pattern matching on the string.
1102 # We go from longer names to shorter names to catch roberta before bert (for instance)
File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:795, in _LazyConfigMapping.getitem(self, key)
793 return self._extra_content[key]
794 if key not in self._mapping:
--> 795 raise KeyError(key)
796 value = self._mapping[key]
797 module_name = model_type_to_module_name(key)
KeyError: 'qwen2'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 |
| 0% 30C P8 16W / 150W | 0MiB / 22731MiB | 0% Default |
When deploying Qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4
with vLLM v0.3.0, calling it via cURL as shown in README.md will result in the following error.
If stream=true
is used, there won't be an error, but the output will be a series of "!" tokens. It seems like there is an error related to the vocabulary while decoding.
It's worth mentioning that the error only occurs in the 0.5B version; for other GPTQ Int4 versions like 1.8B, 4B, and 7B, this error does NOT occur.
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f5e8e5380d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f5e8bdd2e90>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f5e8e5380d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f5e8bdd2e90>)>
Traceback (most recent call last):
File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
task.result()
File "/workspace/vllm/engine/async_llm_engine.py", line 409, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/workspace/vllm/engine/async_llm_engine.py", line 388, in engine_step
request_outputs = await self.engine.step_async()
File "/workspace/vllm/engine/async_llm_engine.py", line 203, in step_async
return self._process_model_outputs(output, scheduler_outputs)
File "/workspace/vllm/engine/llm_engine.py", line 715, in _process_model_outputs
self._process_sequence_group_outputs(seq_group, outputs)
File "/workspace/vllm/engine/llm_engine.py", line 586, in _process_sequence_group_outputs
self._decode_sequence(seq, seq_group.sampling_params)
File "/workspace/vllm/engine/llm_engine.py", line 891, in _decode_sequence
read_offset) = detokenize_incrementally(
File "/workspace/vllm/transformers_utils/tokenizer.py", line 221, in detokenize_incrementally
new_text = tokenizer.convert_tokens_to_string(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 612, in convert_tokens_to_string
return self.backend_tokenizer.decoder.decode(tokens)
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'
如标题
When you run with model Qwen/Qwen1.5-0.5B-Chat
:
finetune/finetune_lora_single_gpu.sh
You get the error:
Traceback (most recent call last):
File "/Qwentuning/Qwen/finetune.py", line 374, in <module>
train()
File "/Qwentuning/Qwen/finetune.py", line 328, in train
tokenizer.pad_token_id = tokenizer.eod_id
AttributeError: 'Qwen2Tokenizer' object has no attribute 'eod_id'
Should train the model
pip install transformers peft deepspeed datasets
finetune/finetune_lora_single_gpu.sh
: model name to Qwen/Qwen1.5-0.5B-Chat
finetune/finetune_lora_single_gpu.sh
- OS: Linux
- Python: 3.10
- Transformers: 4.37.2
- PyTorch: 2.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.7
No response
Model is getting loaded very slow with new code compared earlier custom modelling code.
This is an example project to test how to use padding-free training and Multipack Sampler from openchat, achieving a 3~10x speedup compared to the conventional padded training.
Hi I am using qwen1.5 72b with vllm with this on 4 A100 80G
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-72B-Chat --enforce-eager --tensor-parallel-size 4 --gpu-memory-utilization 1.0 --max-model-len 4096
Do I need to specify a template? Currently vllm tells me that it is using
INFO 02-12 14:59:44 serving_chat.py:260] Using default chat template:
INFO 02-12 14:59:44 serving_chat.py:260] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 02-12 14:59:44 serving_chat.py:260] You are a helpful assistant<|im_end|>
INFO 02-12 14:59:44 serving_chat.py:260] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 02-12 14:59:44 serving_chat.py:260] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 02-12 14:59:44 serving_chat.py:260] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 02-12 14:59:44 serving_chat.py:260] ' }}{% endif %}
请教一下:和不量化的7B/14B模型对比,72B模型量化后,效果怎么样呢?是量化的72B会差很多,还是完全没有可比性呢?
Tokenizer应该和1.0通用吧?
我尝试了沿用以前部署Qwen-chat-14b的代码对Qwen1.5-chat-14b进行vllm部署,然后在RAG流程中沿用之前的请求方式。
发现回答结果有不少变得很长,甚至出现了一些类似"\n\n\n\n\n"的内容,或吐出了一些不太相关的内容
根据经验判断,要么是没有配置好prompt,要么是没有配置好stopword
请问Qwen1.5b的prompt、stopword等相比于上一代是否发生了变化?
url = "/v1/chat/completions"
data = {
# "model": "/gpdata/ideal/download/llama-factory/qwen1.5-0.5B-Chat",
"model": "qwen",
# "model": "/gpdata/ideal/moneymarket/Qwen-7B-Chat/",
# "model": "/gpdata/ideal/moneymarket/ChatGLM3-6B-32k/",
"messages": openai_bot.messages_base + [
{
"role": "user",
"content": prompt
}
],
"temperature": 0,
"top_p": 1,
"max_length": 100,
"stream": False,
# "stop": ["<|im_start|>", "<|im_end|>"],
# "stop": ["151645","151643"]
"add_generation_prompt": True,
}
微调训练参数:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--stage sft
--do_train True
--model_name_or_path /gpdata/ideal/download/qwen1.5-0.5B-Chat
--finetuning_type full
--template qwen
--dataset_dir data
--dataset moneymarket
--cutoff_len 1024
--learning_rate 5e-05
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 100
--warmup_steps 0
--lora_rank 8
--lora_dropout 0.1
--lora_target q_proj,v_proj
--output_dir saves/Qwen1.5-0.5B-Chat/full/train_2024-02-18-16-37-04
--fp16 True
--val_size 0.1
--evaluation_strategy steps
--eval_steps 100
--load_best_model_at_end True
--plot_loss True
Qwen1.5-7B-Chat-GPTQ-Int4需要在config.json中的"quantization_config"下的"exllama_config",加入"disable_exllma": true才不会报错:
{
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"quantization_config": {
"batch_size": 1,
"bits": 4,
"block_name_to_quantize": null,
"cache_block_outputs": true,
"damp_percent": 0.01,
"dataset": null,
"desc_act": false,
"exllama_config": {
"version": 1
},
"group_size": 128,
"max_input_length": null,
"model_seqlen": null,
"module_name_preceding_first_block": null,
"modules_in_block_to_quantize": null,
"pad_token_id": null,
"quant_method": "gptq",
"sym": true,
"tokenizer": null,
"true_sequential": true,
"use_cuda_fp16": false,
"use_exllama": true,
"disable_exllma": true
},
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.37.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
Exllama2 currently has the best quantization strategy and the fastest inference speed except tensorrt-llm. In addition, its memory usage during inference is also minimal.
Can this efficient inference backend be supported?
python3 -m vllm.entrypoints.openai.api_server --model /root/Qwen1.5-14B-Chat
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 25, in
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 22, in
TypeTokenIDs = list[int]
TypeError: 'type' object is not subscriptable
copy issue, ollama/ollama#2405
(https://github.com/QwenLM/Qwen1.5#-hugging-face-transformers) 使用这个运行程序后,出现异常
(qwen1.5) root@test:/home/test/qwen1.5# python chat.py
Traceback (most recent call last):
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1364, in _get_module
return importlib.import_module("." + module_name, self.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1204, in _gcd_import
File "", line 1176, in _find_and_load
File "", line 1147, in _find_and_load_unlocked
File "", line 690, in _load_unlocked
File "", line 940, in exec_module
File "", line 241, in _call_with_frames_removed
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 49, in
from flash_attn import flash_attn_func, flash_attn_varlen_func
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/test/qwen1.5/chat.py", line 4, in
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
model_class = _get_model_class(config, cls._model_mapping)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 387, in _get_model_class
supported_models = model_mapping[type(config)]
~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 740, in getitem
return self._load_attr_from_module(model_type, model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 754, in _load_attr_from_module
return getattribute_from_module(self._modules[module_name], attr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 698, in getattribute_from_module
if hasattr(module, attr):
^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1354, in getattr
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1366, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.qwen2.modeling_qwen2 because of the following error (look up to see its traceback):
/root/anaconda3/envs/qwen1.5/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
(qwen1.5) root@test:/home/test/qwen1.5#
(qwen1.5) root@test:/home/test/qwen1.5# pip list
Package Version
accelerate 0.24.1
altair 5.2.0
attrs 23.2.0
blinker 1.7.0
cachetools 5.3.2
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
einops 0.7.0
filelock 3.13.1
flash-attn 2.5.0
fsspec 2024.2.0
gitdb 4.0.11
GitPython 3.1.41
huggingface-hub 0.20.3
idna 3.6
importlib-metadata 6.11.0
Jinja2 3.1.3
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
mpmath 1.3.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
packaging 23.2
pandas 2.2.0
Pillow 9.5.0
pip 23.3.1
protobuf 4.25.2
psutil 5.9.8
pyarrow 15.0.0
pydeck 0.8.1b0
Pygments 2.17.2
Pympler 1.0.1
python-dateutil 2.8.2
pytz 2024.1
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.31.0
rich 13.7.0
rpds-py 0.17.1
safetensors 0.4.2
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
some-package 0.1
streamlit 1.24.0
sympy 1.12
tenacity 8.2.3
tokenizers 0.15.1
toml 0.10.2
toolz 0.12.1
torch 2.2.0
torchaudio 2.2.0
torchvision 0.17.0
tornado 6.4
tqdm 4.66.1
transformers 4.37.2
transformers-stream-generator 0.0.4
triton 2.2.0
typing_extensions 4.9.0
tzdata 2023.4
tzlocal 4.3.1
urllib3 2.2.0
validators 0.22.0
watchdog 4.0.0
wheel 0.41.2
zipp 3.17.0
(qwen1.5) root@test:/home/test/qwen1.5# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Wed Feb 7 14:39:58 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 34% 55C P2 344W / 450W | 698MiB / 24564MiB | 94% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Hi, what is the correct way to get the vocab size from the tokenizer in HF?
I tried the following:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat-GPTQ-Int8")
tokenizer.vocab_size and len(tokenizer) give 151643 and 151646 whereas the model outputs 151936
No response
No response
No response
SWIFT is a light-weighted training framework of ModelScope community. You can visit our github website:
https://github.com/modelscope/swift
NA
NA
No response
使用openai.py运行时,再调用报错,该如何解决?
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'
正常调用
执行python openai.py
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
帮忙看下,感谢
现在的webui不支持qwen v1.5版本,能提供一个支持v1.5的版本么?
想问下1.5版本部署还支持上一版的openai_api吗,还是必须要用vLLM和SGlang了?其次想问一下72B_int4模型大概需要多大的显存,输入长度怎么样?我之前下了int8模型,在80Ga100上可以部署,但是输入长度超过一两百就显存不够了。
用qwen1.5-14b-chat进行测试,发现用langchain的这两个agent,均不能调用工具,所以请问一下qwen1.5支持langchain的这两个agent吗?如果支持,该如何使用呢
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.