Giter Site home page Giter Site logo

glm-4's Introduction

GLM-4

📄 Report • 🤗 HF Repo • 🤖 ModelScope • 🟣 WiseModel • 🐦 Twitter • 👋 加入我们的 Discord微信

📍在 智谱AI开放平台 体验和使用更大规模的 GLM 商业模型。

Read this in English

项目更新

  • 🔥🔥 News: 2024/07/24: 我们发布了与长文本相关的最新技术解读,关注 这里 查看我们在训练 GLM-4-9B 开源模型中关于长文本技术的技术报告。
  • 🔥 News: 2024/7/16: GLM-4-9B-Chat 模型依赖的 transformers版本升级到 4.42.4, 请更新模型配置文件并参考 basic_demo/requirements.txt 更新依赖。
  • 🔥 News: 2024/7/9: GLM-4-9B-Chat 模型已适配 Ollama,Llama.cpp,您可以在PR 查看具体的细节。
  • 🔥 News: 2024/7/1: 我们更新了 GLM-4V-9B 的微调,您需要更新我们的模型仓库的运行文件和配置文件, 以支持这个功能,更多微调细节 (例如数据集格式,显存要求),请前往 查看
  • 🔥 News: 2024/6/28: 我们与英特尔技术团队合作,改进了 GLM-4-9B-Chat 的 ITREX 和 OpenVINO 部署教程。您可以使用英特尔 CPU/GPU 设备高效部署 GLM-4-9B 开源模型。欢迎访问 查看
  • 🔥 News: 2024/6/24: 我们更新了模型仓库的运行文件和配置文件,支持 Flash Attention 2, 请更新模型配置文件并参考 basic_demo/trans_cli_demo.py 中的示例代码。
  • 🔥 News: 2024/6/19: 我们更新了模型仓库的运行文件和配置文件,修复了部分已知的模型推理的问题,欢迎大家克隆最新的模型仓库。
  • 🔥 News: 2024/6/18: 我们发布 技术报告, 欢迎查看。
  • 🔥 News: 2024/6/05: 我们发布 GLM-4-9B 系列开源模型

模型介绍

GLM-4-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源版本。 在语义、数学、推理、代码和知识等多方面的数据集测评中, GLM-4-9B 及其人类偏好对齐的版本 GLM-4-9B-Chat 均表现出超越 Llama-3-8B 的卓越性能。除了能进行多轮对话,GLM-4-9B-Chat 还具备网页浏览、代码执行、自定义工具调用(Function Call)和长文本推理(支持最大 128K 上下文)等高级功能。本代模型增加了多语言支持,支持包括日语,韩语,德语在内的 26 种语言。我们还推出了支持 1M 上下文长度(约 200 万中文字符)的 GLM-4-9B-Chat-1M 模型和基于 GLM-4-9B 的多模态模型 GLM-4V-9B。GLM-4V-9B 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力,在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中,GLM-4V-9B 表现出超越 GPT-4-turbo-2024-04-09、Gemini 1.0 Pro、Qwen-VL-Max 和 Claude 3 Opus 的卓越性能。

Model List

Model Type Seq Length Download Online Demo
GLM-4-9B Base 8K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel /
GLM-4-9B-Chat Chat 128K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel 🤖 ModelScope CPU
🤖 ModelScope vLLM
GLM-4-9B-Chat-1M Chat 1M 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel /
GLM-4V-9B Chat 8K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel 🤖 ModelScope

评测结果

对话模型典型任务

Model AlignBench MT-Bench IFEval MMLU C-Eval GSM8K MATH HumanEval NaturalCodeBench
Llama-3-8B-Instruct 6.40 8.00 68.6 68.4 51.3 79.6 30.0 62.2 24.7
ChatGLM3-6B 5.18 5.50 28.1 61.4 69.0 72.3 25.7 58.5 11.3
GLM-4-9B-Chat 7.01 8.35 69.0 72.4 75.6 79.6 50.6 71.8 32.2

基座模型典型任务

Model MMLU C-Eval GPQA GSM8K MATH HumanEval
Llama-3-8B 66.6 51.2 - 45.8 - 33.5
Llama-3-8B-Instruct 68.4 51.3 34.2 79.6 30.0 62.2
ChatGLM3-6B-Base 61.4 69.0 26.8 72.3 25.7 58.5
GLM-4-9B 74.7 77.1 34.3 84.0 30.4 70.1

由于 GLM-4-9B 在预训练过程中加入了部分数学、推理、代码相关的 instruction 数据,所以将 Llama-3-8B-Instruct 也列入比较范围。

长文本

在 1M 的上下文长度下进行大海捞针实验,结果如下:

needle

在 LongBench-Chat 上对长文本能力进行了进一步评测,结果如下:

描述文字

多语言能力

在六个多语言数据集上对 GLM-4-9B-Chat 和 Llama-3-8B-Instruct 进行了测试,测试结果及数据集对应选取语言如下表

Dataset Llama-3-8B-Instruct GLM-4-9B-Chat Languages
M-MMLU 49.6 56.6 all
FLORES 25.0 28.8 ru, es, de, fr, it, pt, pl, ja, nl, ar, tr, cs, vi, fa, hu, el, ro, sv, uk, fi, ko, da, bg, no
MGSM 54.0 65.3 zh, en, bn, de, es, fr, ja, ru, sw, te, th
XWinograd 61.7 73.1 zh, en, fr, jp, ru, pt
XStoryCloze 84.7 90.7 zh, en, ar, es, eu, hi, id, my, ru, sw, te
XCOPA 73.3 80.1 zh, et, ht, id, it, qu, sw, ta, th, tr, vi

工具调用能力

我们在 Berkeley Function Calling Leaderboard 上进行了测试并得到了以下结果:

Model Overall Acc. AST Summary Exec Summary Relevance
Llama-3-8B-Instruct 58.88 59.25 70.01 45.83
gpt-4-turbo-2024-04-09 81.24 82.14 78.61 88.75
ChatGLM3-6B 57.88 62.18 69.78 5.42
GLM-4-9B-Chat 81.00 80.26 84.40 87.92

多模态能力

GLM-4V-9B 是一个多模态语言模型,具备视觉理解能力,其相关经典任务的评测结果如下:

MMBench-EN-Test MMBench-CN-Test SEEDBench_IMG MMStar MMMU MME HallusionBench AI2D OCRBench
gpt-4o-2024-05-13 83.4 82.1 77.1 63.9 69.2 2310.3 55.0 84.6 736
gpt-4-turbo-2024-04-09 81.0 80.2 73.0 56.0 61.7 2070.2 43.9 78.6 656
gpt-4-1106-preview 77.0 74.4 72.3 49.7 53.8 1771.5 46.5 75.9 516
InternVL-Chat-V1.5 82.3 80.7 75.2 57.1 46.8 2189.6 47.4 80.6 720
LLaVA-Next-Yi-34B 81.1 79.0 75.7 51.6 48.8 2050.2 34.8 78.9 574
Step-1V 80.7 79.9 70.3 50.0 49.9 2206.4 48.4 79.2 625
MiniCPM-Llama3-V2.5 77.6 73.8 72.3 51.8 45.8 2024.6 42.4 78.4 725
Qwen-VL-Max 77.6 75.7 72.7 49.5 52.0 2281.7 41.2 75.7 684
Gemini 1.0 Pro 73.6 74.3 70.7 38.6 49.0 2148.9 45.7 72.9 680
Claude 3 Opus 63.3 59.2 64.0 45.7 54.9 1586.8 37.8 70.6 694
GLM-4V-9B 81.1 79.4 76.8 58.7 47.2 2163.8 46.6 81.1 786

快速调用

硬件配置和系统要求,请查看这里

使用以下方法快速调用 GLM-4-9B-Chat 语言模型

使用 transformers 后端进行推理:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4-9b-chat",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

使用 vLLM 后端进行推理:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

使用以下方法快速调用 GLM-4V-9B 多模态模型

使用 transformers 后端进行推理:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)

query = '描述这张图片'
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                       add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                       return_dict=True)  # chat mode

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4v-9b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

注意: GLM-4V-9B 暂不支持使用 vLLM 方式调用。

完整项目列表

如果你想更进一步了解 GLM-4-9B 系列开源模型,本开源仓库通过以下内容为开发者提供基础的 GLM-4-9B的使用和开发代码

  • basic_demo: 在这里包含了

    • 使用 transformers 和 vLLM 后端的交互代码
    • OpenAI API 后端交互代码
    • Batch 推理代码
  • composite_demo: 在这里包含了

    • GLM-4-9B-Chat 以及 GLM-4V-9B 开源模型的完整功能演示代码,包含了 All Tools 能力、长文档解读和多模态能力的展示。
  • fintune_demo: 在这里包含了

    • PEFT (LORA, P-Tuning) 微调代码
    • SFT 微调代码

友情链接

  • LLaMA-Factory: 高效开源微调框架,已支持 GLM-4-9B-Chat 语言模型微调。
  • SWIFT: 魔搭社区的大模型/多模态大模型训练框架,已支持 GLM-4-9B-Chat / GLM-4V-9B 模型微调。
  • Xorbits Inference: 性能强大且功能全面的分布式推理框架,轻松一键部署你自己的模型或内置的前沿开源模型。
  • LangChain-ChatChat: 基于 Langchain 与 ChatGLM 等语言模型的 RAG 与 Agent 应用
  • self-llm: Datawhale 团队的提供的 GLM-4-9B 系列模型使用教程。
  • chatglm.cpp: 类似 llama.cpp 的量化加速推理方案,实现笔记本上实时对话

协议

  • GLM-4 模型的权重的使用则需要遵循 模型协议

  • 本开源仓库的代码则遵循 Apache 2.0 协议。

请您严格遵循开源协议。

引用

如果你觉得我们的工作有帮助的话,请考虑引用下列论文。

@misc{glm2024chatglm,
      title={ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools}, 
      author={Team GLM and Aohan Zeng and Bin Xu and Bowen Wang and Chenhui Zhang and Da Yin and Diego Rojas and Guanyu Feng and Hanlin Zhao and Hanyu Lai and Hao Yu and Hongning Wang and Jiadai Sun and Jiajie Zhang and Jiale Cheng and Jiayi Gui and Jie Tang and Jing Zhang and Juanzi Li and Lei Zhao and Lindong Wu and Lucen Zhong and Mingdao Liu and Minlie Huang and Peng Zhang and Qinkai Zheng and Rui Lu and Shuaiqi Duan and Shudan Zhang and Shulin Cao and Shuxun Yang and Weng Lam Tam and Wenyi Zhao and Xiao Liu and Xiao Xia and Xiaohan Zhang and Xiaotao Gu and Xin Lv and Xinghan Liu and Xinyi Liu and Xinyue Yang and Xixuan Song and Xunkai Zhang and Yifan An and Yifan Xu and Yilin Niu and Yuantao Yang and Yueyan Li and Yushi Bai and Yuxiao Dong and Zehan Qi and Zhaoyu Wang and Zhen Yang and Zhengxiao Du and Zhenyu Hou and Zihan Wang},
      year={2024},
      eprint={2406.12793},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

glm-4's People

Contributors

duzx16 avatar ghohoj avatar jiawei243 avatar khazic avatar linnnnnzf avatar liudq-wm avatar lr-tsinghua11 avatar qq332982511 avatar sengxian avatar skyflap avatar t-atlas avatar tastelikefeet avatar tianlinliang avatar vinlic avatar wwewwt avatar xiao9905 avatar zrzrzrzrzrzrzr avatar ztxtech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glm-4's Issues

python trans_web_demo.py多卡运行提示错误,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

System Info / 系統信息

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:01:00.0 Off | N/A |
| 16% 36C P8 11W / 250W | 8378MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce GTX 1070 Off | 00000000:02:00.0 Off | N/A |
| 0% 32C P8 9W / 230W | 7892MiB / 8192MiB | 0% Default |
| | | N/A |

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

python trans_web_demo.py 多卡运行,对模型提问时候提示错误,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

Expected behavior / 期待表现

python trans_web_demo.py, 提问的时候能正常输出

TypeError: Fraction.__new__() got an unexpected keyword argument '_normalize'

System Info / 系統信息

ubutu 22.04
cuda 12.1
python 3.12.3
pytorch 2.3.0
transformers 4.40.0

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

按照官方代码构建微调数据集和微调脚本,运行脚本如下:

python finetune.py  data/  /data/share/rwq/glm-4-9b-chat  configs/lora_hb.yaml

运行结果:


│ /home/powerop/work/conda/envs/glm4/lib/python3.12/site-packages/transformers │
│ /trainer.py:3719 in evaluation_loop                                          │
│                                                                              │
│   3716 │   │   │   │   │   EvalPrediction(predictions=all_preds, label_ids=a │
│   3717 │   │   │   │   )                                                     │
│   3718 │   │   │   else:                                                     │
│ ❱ 3719 │   │   │   │   metrics = self.compute_metrics(EvalPrediction(predict │
│   3720 │   │   else:                                                         │
│   3721 │   │   │   metrics = {}                                              │
│   3722                                                                       │
│                                                                              │
│ /home/powerop/work/rwq/GLM-4/finetune_demo/finetune.py:333 in                │
│ compute_metrics                                                              │
│                                                                              │
│   330 │   │   for k, v in scores[0].items():                                 │
│   331 │   │   │   metrics_dct[k].append(round(v['f'] * 100, 4))              │
│   332 │   │   metrics_dct['bleu-4'].append(                                  │
│ ❱ 333 │   │   │   sentence_bleu([label_tokens], pred_tokens, smoothing_funct │
│   334 │   return {k: np.mean(v) for k, v in metrics_dct.items()}             │
│   335                                                                        │
│   336                                                                        │
│                                                                              │
│ /home/powerop/work/conda/envs/glm4/lib/python3.12/site-packages/nltk/transla │
│ te/bleu_score.py:107 in sentence_bleu                                        │
│                                                                              │
│   104 │   :return: The sentence-level BLEU score. Returns a list if multiple │
│   105 │   :rtype: float / list(float)                                        │
│   106 │   """                                                                │
│ ❱ 107 │   return corpus_bleu(                                                │
│   108 │   │   [references], [hypothesis], weights, smoothing_function, auto_ │
│   109 │   )                                                                  │
│   110                                                                        │
│                                                                              │
│ /home/powerop/work/conda/envs/glm4/lib/python3.12/site-packages/nltk/transla │
│ te/bleu_score.py:210 in corpus_bleu                                          │
│                                                                              │
│   207 │   │   # For each order of ngram, calculate the numerator and         │
│   208 │   │   # denominator for the corpus-level modified precision.         │
│   209 │   │   for i in range(1, max_weight_length + 1):                      │
│ ❱ 210 │   │   │   p_i = modified_precision(references, hypothesis, i)        │
│   211 │   │   │   p_numerators[i] += p_i.numerator                           │
│   212 │   │   │   p_denominators[i] += p_i.denominator                       │
│   213                                                                        │
│                                                                              │
│ /home/powerop/work/conda/envs/glm4/lib/python3.12/site-packages/nltk/transla │
│ te/bleu_score.py:368 in modified_precision                                   │
│                                                                              │
│   365 │   # Usually this happens when the ngram order is > len(reference).   │
│   366 │   denominator = max(1, sum(counts.values()))                         │
│   367 │                                                                      │
│ ❱ 368 │   return Fraction(numerator, denominator, _normalize=False)          │
│   369                                                                        │
│   370                                                                        │
│   371 def closest_ref_length(references, hyp_len):                           │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: Fraction.__new__() got an unexpected keyword argument '_normalize'
 17%|█▋        | 500/3000 [06:29<32:27,  1.28it/s]



Expected behavior / 期待表现

期望不报错,训练可以成功结束

能有支持ollama的版本发布吗?

Feature request / 功能建议

能有支持ollama的版本发布吗?

Motivation / 动机

能有支持ollama的版本发布吗?

Your contribution / 您的贡献

能有支持ollama的版本发布吗?

建议优化openai_api_server代码,使用funcation_calling返回时与openai格式保持一致

Feature request / 功能建议

使用glm返回的结果是:
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "query_train_info\n{"departure": "上海", "destination": "北京", "date": "2021-08-23"}",
"name": null,
"function_call": null
},
"finish_reason": "stop"
}
]

=================================================
而同样的接口使用https://open.bigmodel.cn/api/paas/v4/chat/completions返回的结果是
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"tool_calls": [
{
"function": {
"arguments": "{"date":"明天","departure":"北京","destination":"上海"}",
"name": "query_train_info"
},
"id": "call_8718084940088338127",
"index": 0,
"type": "function"
}
]
}
}
]

Motivation / 动机

Your contribution / 您的贡献

ms-swift支持了glm-4v-9b多模态大模型的微调(finetune)🚀😊

ms-swift多模态大模型微调框架集成了glm-4v-9b的推理和微调, 并书写了最佳实践: https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/glm4v%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

如果有感兴趣的小伙伴, 欢迎来使用😊

The ms-swift multi-modal large model fine-tuning framework integrates the inference and fine-tuning of glm-4v-9b, and provides best practice documentation: https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/glm4v%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

If you are interested, feel free to give it a try! 😊

请教处理多模态的话,GLM4好还是CogVLM2好?

我看GLM4的Benchmark分数似乎更高,但是CogVLM2不是19B的参数吗,开源时间也就只相隔几天,为什么会有少了一半的参数量却有更好Benchamark结果的现象,那么专门用于图像处理的场景我该选型哪个更好?

是否可以提供int4量化的版本?

我注意到base内的README提到了BF16和INT4两种精度的模型显存占用和生成速度测试情况,但目前只提供了BF16版本的模型。未来是否会官方提供INT4版本的模型?

常用信息

微调问题讨论区:
#40
配置问题讨论区:
#41
官方微信群:
image

本仓库没有书写合并微调模型并使用vLLM的脚本,请自行根据vLLM的文档进行配置。

vllm cli 命令行启动,仍然报OOM的错误,

System Info / 系統信息

A30 24G
docker-compose 文件
version: '3.8'
services:
vllm_server_aiorder:
image: jjck_vllm:20240523
runtime: nvidia
ipc: host
restart: always
ports:
- "8100:8000"
volumes:
- "/etc/localtime:/etc/localtime:ro"
- "/home///**/glm-4-9b-chat:/aioder/llm_model"
command: --model /aioder/llm_model --trust-remote-codec
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]

Who can help? / 谁可以帮助到您?

1

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

启动时容器已经正确接收参数

vllm_server_aiorder-vllm_server_aiorder-1 | INFO 06-05 14:36:55 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='/aioder/llm_model', speculative_config=None, tokenizer='/aioder/llm_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/aioder/llm_model)
启动过程
vllm_server_aiorder-vllm_server_aiorder-1 | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vllm_server_aiorder-vllm_server_aiorder-1 | WARNING 06-05 14:36:55 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
vllm_server_aiorder-vllm_server_aiorder-1 | INFO 06-05 14:37:01 model_runner.py:146] Loading model weights took 17.5635 GB
最后还是报错
vllm_server_aiorder-vllm_server_aiorder-1 | [rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.69 GiB. GPU

Expected behavior / 期待表现

可以告诉我是什么问题吗?

trans_web_demo.py报推理卡错误

System Info / 系統信息

transformers 4.40.0
torch 2.3.0

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

运行trans_web_demo.py,报下述错误,modeling_chatglm.py 627行。

Exception has occurred: RuntimeError
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:2! (when checking argument for argument tensors in method wrapper_CUDA_cat)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 627, in forward
presents = torch.cat((presents, kv_cache), dim=0)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 777, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 881, in forward
transformer_outputs = self.transformer(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:2! (when checking argument for argument tensors in method wrapper_CUDA_cat)

Expected behavior / 期待表现

...

LLaMA Factory + GLM4 微调最佳实践

LLaMA Factory 支持了 GLM-4-9B 和 GLM-4-9B-Chat 模型的指令微调、RLHF、DPO 和 SimPO 等优化方法

https://github.com/hiyouga/LLaMA-Factory/blob/main/README_zh.md

指令微调

CUDA_VISIBLE_DEVICES=0,1 HF_ENDPOINT=https://hf-mirror.com llamafactory-cli train sft.yaml

sft.yaml 文件内容:

### model
model_name_or_path: THUDM/glm-4-9b-chat

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: identity,alpaca_en_demo,alpaca_zh_demo
template: glm4
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/glm4-sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

sft

多卡推理

CUDA_VISIBLE_DEVICES=0,1 HF_ENDPOINT=https://hf-mirror.com llamafactory-cli chat \
    --model_name_or_path THUDM/glm-4-9b-chat \
    --adapter_name_or_path saves/glm4-sft \
    --template glm4 \
    --finetuning_type lora

资源使用

LoRA: ~20GB
QLoRA: ~10GB
半精度推理:~18GB
4-bit 推理:~7GB

vram

GLM-4V-9B 支持 llama.cpp

Feature request / 功能建议

希望 GLM-4V-9B 能够支持 llama.cpp

Motivation / 动机

支持 llama.cpp,可以在 ollama 调用,需要的运算资源比较少,调用方便,希望支持

Your contribution / 您的贡献

vllm_cli_demo报错

System Info / 系統信息

python3.10 cuda2.2 显存24G

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

vllm_cli_demo.py文件

engine_args = AsyncEngineArgs(
model=model_dir,
tokenizer=model_dir,
tensor_parallel_size=1,
dtype="bfloat16",
trust_remote_code=True,
gpu_memory_utilization=0.3,
enforce_eager=True,
worker_use_ray=True,
engine_use_ray=False,
disable_log_requests=True
#如果遇见 OOM 现象,建议开启下述参数
#enable_chunked_prefill=True,
#max_num_batched_tokens=8192
)
上面的配置,报 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 的错误

按照官方的提示将上面的参数调整为
engine_args = AsyncEngineArgs(
model=model_dir,
tokenizer=model_dir,
tensor_parallel_size=1,
dtype="bfloat16",
trust_remote_code=True,
gpu_memory_utilization=0.3,
enforce_eager=True,
worker_use_ray=True,
engine_use_ray=False,
disable_log_requests=True,
#如果遇见 OOM 现象,建议开启下述参数
enable_chunked_prefill=True,
max_num_batched_tokens=8192
)
又出现如下错误:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-05 21:11:03 config.py:676] Chunked prefill is enabled (EXPERIMENTAL).
2024-06-05 21:11:05,275 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-05 21:11:05 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/joyue/model/glm-4-9b-chat', speculative_config=None, tokenizer='/joyue/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/joyue/model/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-05 21:11:05 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 06-05 21:11:08 model_runner.py:146] Loading model weights took 17.5635 GB
INFO 06-05 21:11:10 distributed_gpu_executor.py:56] # GPU blocks: 0, # CPU blocks: 6553
ERROR 06-05 21:11:10 worker_base.py:148] Error executing method initialize_cache. This might cause deadlock in distributed execution.
ERROR 06-05 21:11:10 worker_base.py:148] Traceback (most recent call last):
ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-05 21:11:10 worker_base.py:148] return executor(*args, **kwargs)
ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache
ERROR 06-05 21:11:10 worker_base.py:148] raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid
ERROR 06-05 21:11:10 worker_base.py:148] raise ValueError("No available memory for the cache blocks. "
ERROR 06-05 21:11:10 worker_base.py:148] ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.
[rank0]: Traceback (most recent call last):
[rank0]: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 46, in
[rank0]: engine, tokenizer = load_model_and_tokenizer(MODEL_PATH)
[rank0]: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 42, in load_model_and_tokenizer
[rank0]: engine = AsyncLLMEngine.from_engine_args(engine_args)
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]: self._run_workers("initialize_cache",
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]: driver_worker_output = self.driver_worker.execute_method(
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]: raise e
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid
[rank0]: raise ValueError("No available memory for the cache blocks. "
[rank0]: ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine

Expected behavior / 期待表现

请问如何解决?

使用微调后的模型报错,包括inference.py和vllm.py

System Info / 系統信息

  • cuda 12.4
  • python 3.10
  • transformers 4.41.2

Who can help? / 谁可以帮助到您?

ValueError: Unrecognized configuration class <class 'transformers_modules.checkpoint-8607.configuration_chatglm.ChatGLMConfig'> to build an AutoTokenizer.

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

python inference.py glm-sft/squeeze_prompt/model/checkpoint-8607

Expected behavior / 期待表现

能正常用inference和vllm测试

使用composite_demo运行多模态模型出现错误RuntimeError: view size is not compatible with input tensor's size....

System Info / 系統信息

  • 系统OS:Ubuntu20
  • 硬件型号:Nvidia V100
  • CUDA版本:11.8
  • Python版本:3.10
  • 软件依赖:pip install -r requirements.txt

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

conda create -n glm4.1 python=3.10
conda activate glm4.1
cd ~/ZhipuAI/GLM-4/composite_demo
pip install -r requirements.txt
# 从modelscope下载glm-4v-9b到本地,如下命令启动本地模型
streamlit run  src/main.py

Expected behavior / 期待表现

多模态模式下,上传图片进行识别的时候出现如下错误日志,

错误日志1:
Exception in thread Thread-18 (generate):
Traceback (most recent call last):
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/transformers/generation/utils.py", line 1622, in generate
result = self._sample(
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/transformers/generation/utils.py", line 2791, in _sample
outputs = self(
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py", line 1017, in forward
transformer_outputs = self.transformer(
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py", line 855, in forward
images_features = self.vision(images)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/visual.py", line 157, in forward
x = self.transformer(x)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/visual.py", line 120, in forward
hidden_states = layer_module(hidden_states)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/visual.py", line 105, in forward
attention_output = self.input_layernorm(self.attention(attention_input))
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/visual.py", line 69, in forward
output = self.dense(out.transpose(1, 2).view(B, L, -1))
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

错误日志2:
Traceback (most recent call last):
File "/root/ZhipuAI/GLM-4/composite_demo/src/main.py", line 287, in main
for response, chat_history in client.generate_stream(
File "/root/ZhipuAI/GLM-4/composite_demo/src/clients/hf.py", line 57, in generate_stream
for token_text in streamer:
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/site-packages/transformers/generation/streamers.py", line 223, in next
value = self.text_queue.get(timeout=self.timeout)
File "/home/vipuser/anaconda3/envs/glm4.1/lib/python3.10/queue.py", line 179, in get
raise Empty
_queue.Empty

low_cpu_mem_usage=True 参数使用上可能有问题

System Info / 系統信息

win10 系统,e5-2090v4 64G内存 2080ti 22G 显卡
系统关闭虚拟内存

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

low_cpu_mem_usage=True
在系统已使用内存为21G时(剩余42G),无法启动。(查看系统日志,有提示:Windows 成功诊断出虚拟内存不足的情况。)
关闭一些应用,内存占用15G时,能正常启动。
注销 low_cpu_mem_usage=True 代码后
在系统已使用内存为21G时,能正常启动。

Expected behavior / 期待表现

这个参数在glm3文档里还没有,看上去是新增的。
从名字上看,应该是降低cpu和内存的使用,但是结果是导致无法启动。
不太清楚是我这边的个例,还是这个参数使用上有什么限制,需要先占用更多的内存做一些其它什么优化。
因为这个会在启动时报错,并且命令行里无任何提示,对新手排查不方便。
如果是参数使用上的问题,建议文档最好删除该参数。

模型推理显存占用情况

Feature request / 功能建议

glm4各个模型推理显存占用情况如何?

Motivation / 动机

算力需求评估

Your contribution / 您的贡献

暂无

怎么指定使用两块gpu启动

System Info / 系統信息

怎么指定使用两块gpu启动

Who can help? / 谁可以帮助到您?

怎么指定使用两块gpu启动

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

怎么指定使用两块gpu启动

Expected behavior / 期待表现

怎么指定使用两块gpu启动

Any forword plain to support llama.cpp

Feature request / 功能建议

Llama.cpp provides a simple way to deploy models on server and desktop environment. Looking forward to llama.cpp support.

Motivation / 动机

Your contribution / 您的贡献

basic_demo的依赖是不是写错了

System Info / 系統信息

操作系统 CentOS 7.9
Conda 13.11.0
Python 3.12

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

安装basic_demo的依赖提示找不到bitsandbytes>=0.43.1

最高版本才到0.42.0

Expected behavior / 期待表现

可以正常安装依赖

您好,按照所提供微调代码,工具微调时observation 部分显示是需要计算loss的?这个确定不需要计算loss么还是我哪个计算错了?

System Info / 系統信息

ubuntu 22.04

Who can help? / 谁可以帮助到您?

所有人

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

message: {'role': 'system', 'content': '', 'tools': [{'type': 'function', 'function': {'name': 'get_recommended_books', 'description': "Get recommended books based on user's interests", 'parameters': {'type': 'object', 'properties': {'interests': {'type': 'array', 'items': {'type': 'string'}, 'description': 'The interests to recommend books for'}}, 'required': ['interests']}}}]}
message: {'role': 'user', 'content': 'Hi, I am looking for some book recommendations. I am interested in history and science fiction.'}
message: {'role': 'assistant', 'content': '{"name": "get_recommended_books", "arguments": {"interests": ["history", "science fiction"]}}'}
message: {'role': 'observation', 'content': '{"books": ["Sapiens: A Brief History of Humankind by Yuval Noah Harari", "A Brief History of Time by Stephen Hawking", "Dune by Frank Herbert", "The Martian by Andy Weir"]}'}
message: {'role': 'assistant', 'content': 'Based on your interests in history and science fiction, I would recommend the following books: "Sapiens: A Brief History of Humankind" by Yuval Noah Harari, "A Brief History of Time" by Stephen Hawking, "Dune" by Frank Herbert, and "The Martian" by Andy Weir.'}
input_ids: [[151331, 151333, 151335, 198, 98406, 99950, 103092, 5588, 44, 12, 19, 220, 121245, 99941, 113255, 1773, 103408, 100698, 99126, 100789, 15457, 100597, 109331, 100484, 5588, 44, 12, 19, 6567, 44143, 98604, 110230, 3837, 99444, 126744, 100293, 99833, 100000, 121189, 99089, 108419, 113574, 111426, 3407, 565, 633, 1288, 35692, 72520, 271, 515, 262, 330, 606, 788, 330, 455, 1288, 35692, 72520, 756, 262, 330, 4684, 788, 330, 1949, 11097, 6467, 3118, 389, 1196, 594, 11766, 756, 262, 330, 13777, 788, 341, 286, 330, 1313, 788, 330, 1700, 756, 286, 330, 13185, 788, 341, 310, 330, 12718, 82, 788, 341, 394, 330, 1313, 788, 330, 1653, 756, 394, 330, 3615, 788, 341, 503, 330, 1313, 788, 330, 917, 698, 394, 1153, 394, 330, 4684, 788, 330, 785, 11766, 311, 6934, 6467, 369, 698, 310, 456, 286, 1153, 286, 330, 6279, 788, 2278, 310, 330, 12718, 82, 698, 286, 5133, 262, 456, 532, 98319, 110462, 101409, 102481, 98335, 3837, 98964, 98991, 8307, 118633, 98522, 99424, 98619, 100791, 100955, 1773, 151336, 198, 13041, 11, 358, 1079, 3330, 369, 1045, 2311, 18532, 13, 358, 1079, 8013, 304, 3840, 323, 8037, 16970, 13, 151337, 198, 4913, 606, 788, 330, 455, 1288, 35692, 72520, 497, 330, 16356, 788, 5212, 12718, 82, 788, 4383, 18810, 497, 330, 39342, 16970, 1341, 3417, 151338, 198, 4913, 12104, 788, 4383, 50, 2068, 724, 25, 362, 36333, 11094, 315, 19820, 68759, 553, 27184, 831, 41747, 5227, 2780, 497, 330, 32, 36333, 11094, 315, 4120, 553, 18067, 12605, 10561, 497, 330, 35, 2886, 553, 9267, 56976, 497, 330, 785, 80371, 553, 24756, 1205, 404, 91669, 151337, 198, 28613, 389, 697, 11766, 304, 3840, 323, 8037, 16970, 11, 358, 1035, 6934, 279, 2701, 6467, 25, 330, 50, 2068, 724, 25, 362, 36333, 11094, 315, 19820, 68759, 1, 553, 27184, 831, 41747, 5227, 2780, 11, 330, 32, 36333, 11094, 315, 4120, 1, 553, 18067, 12605, 10561, 11, 330, 35, 2886, 1, 553, 9267, 56976, 11, 323, 330, 785, 80371, 1, 553, 24756, 1205, 404, 13, 151329]]
labels: [[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 198, 4913, 606, 788, 330, 455, 1288, 35692, 72520, 497, 330, 16356, 788, 5212, 12718, 82, 788, 4383, 18810, 497, 330, 39342, 16970, 1341, 3417, 151338, 198, 4913, 12104, 788, 4383, 50, 2068, 724, 25, 362, 36333, 11094, 315, 19820, 68759, 553, 27184, 831, 41747, 5227, 2780, 497, 330, 32, 36333, 11094, 315, 4120, 553, 18067, 12605, 10561, 497, 330, 35, 2886, 553, 9267, 56976, 497, 330, 785, 80371, 553, 24756, 1205, 404, 91669, 151337, 198, 28613, 389, 697, 11766, 304, 3840, 323, 8037, 16970, 11, 358, 1035, 6934, 279, 2701, 6467, 25, 330, 50, 2068, 724, 25, 362, 36333, 11094, 315, 19820, 68759, 1, 553, 27184, 831, 41747, 5227, 2780, 11, 330, 32, 36333, 11094, 315, 4120, 1, 553, 18067, 12605, 10561, 11, 330, 35, 2886, 1, 553, 9267, 56976, 11, 323, 330, 785, 80371, 1, 553, 24756, 1205, 404, 13, 151329]]

Expected behavior / 期待表现

希望尽快回答

function call调用格式化输出

Feature request / 功能建议

请问使用transforer推理方式进行function call调用,
messages = [
{
"role": "system",
"content": system_prompt,
"tools": [{"type": "function", "function":i} for i in skylark_func_list],
},
{
"role": "user",
"content": query
}
]

inputs = tokenizer.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
inputs = inputs.to(device)
如何保证格式化输出,类似于下面,在某个字段里可以取到function_name和提取的参数
image
image

Motivation / 动机

function call调用格式化输出

Your contribution / 您的贡献

tokenization_chatglm.py报错

System Info / 系統信息

tokenizer.decode报错:TypeError: token should only be of type types or str

原因是glm4的词表中的key是以bytes类型存储,而bytes类型在transformers的_decode函数中被遍历会变成int类型。

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

运行如下代码即可复现该错误:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ZhipuAI/glm-4-9b-chat", trust_remote_code=True)
new_str = tokenizer.decode(198)
print(new_str)

Expected behavior / 期待表现

tokenization_chatglm.py中的convert_tokens_to_string函数作如下修改即可解决该问题:

def convert_tokens_to_string(self, tokens: List[Union[bytes, str, int]]) -> str:
    """
    Converts a sequence of tokens in a single string.
    """
    text = ""
    temp = b""
    for t in tokens:
        if isinstance(t, int):
            t = chr(t)
        if isinstance(t, str):
            if temp:
                text += temp.decode("utf-8", errors="replace")
                temp = b""
            text += t
        elif isinstance(t, bytes):
            temp += t
        else:
            raise TypeError("token should only be of type int, bytes or str")
    if temp:
        text += temp.decode("utf-8", errors="replace")
    return text

已向huggingface的glm-4-9b-chat及其1m版本的模型仓库提交PR,希望采纳,感谢开源!

openai_api_server.py报错

System Info / 系統信息

python=3.9,vllm=0.4.0+cu118 torch=2.1.2+cu118

Who can help? / 谁可以帮助到您?

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Traceback (most recent call last):
File "/home/pai/envs/glm4/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/pai/envs/glm4/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/pai/envs/glm4/lib/python3.9/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/home/pai/envs/glm4/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/mnt/workspace/GLM-4/basic_demo/openai_api_server.py", line 338, in create_chat_completion
async for response in generate_stream_glm4(gen_params):
File "/mnt/workspace/GLM-4/basic_demo/openai_api_server.py", line 196, in generate_stream_glm4
async for output in engine.generate(inputs=inputs, sampling_params=sampling_params, request_id="glm-4"):
TypeError: generate() got an unexpected keyword argument 'inputs'

没有这个inputs参数????

Expected behavior / 期待表现

怎么解决

Batch API for GLM-4V Error

System Info / 系統信息

pip install -U zhipuai
windows 11
python 3.10

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

上传jsonl文件后,获得batch id,但每次查询结果,发现压根没有读取图片。
图片url存在本地电脑,做了apache穿透,100m阿里云带宽。假如单独request GLM4V,这个url会返回标签。
排除:图片加载速度问题,排除URL解释问题。

同样的URL放到jsonl,就一直没读图片。因为我的后台监控这穿透的流量。但有时候直接返回结果,出现笼统含糊的幻觉。假如我关闭的电脑导致图片无法访问,可以使用别的照片代替测试。有时候1000个请求,都会瞬间完成输出结果。

JSONL Sample:
{"custom_id": 100008, "method": "POST", "url": "/v4/chat/completions", "body": {"model": "glm-4v", "messages": [{"role": "system", "content": "假设你是一个服装图像标注机器人,给上传服装图片标注各个设计制作标签,越详细越好,但无法观察或不确定标签不用输出,减少幻觉。JSON格式返回。"}, {"role": "user", "content": [{"type": "text", "text": "给上传服装图片标注服装设计制作标签,越详细越好。JSON格式返回。"}, {"type": "image_url", "image_url": "http://garman.natapp1.cc/2024.03.17%2015.53.33_55d01ecd67bc652dc29793a9_65f6a17d00000000120204a3_EP雅莹真丝连衣裙合集_10.png"}]}], "max_tokens": 1000}}

Batch ID:
batch_1797884944774336512
batch_1797920530541981696
batch_179817156076740198

Expected behavior / 期待表现

输出正确的标签,或者error信息。

first system prompt not worked (need add an empty system prompt first).

System Info / 系統信息

vllm 0.4.3
cuda 12.2
model glm-4-9b-chat

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

not worked scenario

request

{
    "model": "glm-4",
    "messages": [
        {
            "role": "system",
            "content": "你需要扮演一个技术专家,名字叫测试助手,你擅长解答技术问题"
        },
        {
            "role": "user",
            "content": "你是谁,你擅长什么",
            "name": null,
            "function_call": null
        }
    ],
    "temperature": 0.8,
    "top_p": 0.8,
    "max_tokens": null,
    "stream": false,
    "repetition_penalty": 1.1
}

response

{
    "model": "glm-4",
    "id": "",
    "object": "chat.completion",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "我是一个人工智能助手,名为 ChatGLM。我是在2023由清华大学 KEG 实验室和智谱 AI 公司共同训练的语言模型开发的人工智能程序。我的优势在于能够理解和回答关于广泛领域的问题,包括但不限于科学、技术、历史、文化等方面。同时,我也擅长处理日常对话,提供生活建议和信息查询等服务。",
                "name": null,
                "function_call": null
            },
            "finish_reason": "stop"
        }
    ],
    "created": 1717641052,
    "usage": {
        "prompt_tokens": 13,
        "total_tokens": 88,
        "completion_tokens": 75
    }
}

worked senario

append an empty system role record in messages list

request

{
    "model": "glm-4",
    "messages": [
        {
            "role": "system",
            "content": ""
        },
        {
            "role": "system",
            "content": "你需要扮演一个技术专家,名字叫测试助手,你擅长解答技术问题"
        },
        {
            "role": "user",
            "content": "你是谁,你擅长什么",
            "name": null,
            "function_call": null
        }
    ],
    "temperature": 0.8,
    "top_p": 0.8,
    "max_tokens": null,
    "stream": false,
    "repetition_penalty": 1.1
}

response

{
    "model": "glm-4",
    "id": "",
    "object": "chat.completion",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "您好!我是测试助手,一名专注于软件和信息技术领域的技术专家。我擅长的内容包括但不限于:\n\n1. **软件开发**:理解编程语言(如Python、Java、C++等)的原理和应用。\n2. **软件测试**:包括单元测试、集成测试、系统测试以及性能测试等。\n3. **自动化测试**:使用工具和技术来自动化测试流程,提高测试效率和覆盖率。\n4. **质量保证**:帮助确保软件产品满足既定的质量标准。\n5. **持续集成/持续部署(CI/CD)**:提供关于如何设置和维护CI/CD管道的建议。\n6. **版本控制**:例如Git的使用和管理。\n7. **数据库管理**:SQL查询优化、数据库设计原则等。\n8. **网络安全**:基本的安全概念和实践,比如加密、认证和授权。\n\n如果您有任何与技术相关的问题或需要解决方案,欢迎随时向我提问。我会尽我所能为您提供专业建议和信息。",
                "name": null,
                "function_call": null
            },
            "finish_reason": "stop"
        }
    ],
    "created": 1717641150,
    "usage": {
        "prompt_tokens": 31,
        "total_tokens": 231,
        "completion_tokens": 200
    }
}

Expected behavior / 期待表现

I believe the issue might be due to the tool choice logic. There seems to be a mechanism that ignores the first system prompt if the user passes the tool_choice parameter. It appears that in the VLLM end, GLM-4 interpreted the first system prompt as the tool_choice parameter and didn't check whether this request actually included a tool choice or not. This caused the first system prompt provided by the user to have no effect on the model's output.

finetune_demo/README.md 文档“多轮对话格式”中示例数据格式不正确

System Info / 系統信息

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

[
  {
    "messages": [
      {
        "role": "system",
        "content": "<system prompt text>",
        "tools": [
          {
            "name": "<tool name>",
            "args": {
              "<arg name>": "<arg value>"
            }
          },
          // tools not closed here
          {
            "role": "user",
            "content": "<user prompt text>"
          },
          {
            "role": "assistant",
            "content": "<assistant response text>"
          },
// ...

此处tools的括号没有闭合

Expected behavior / 期待表现

[
  {
    "messages": [
      {
        "role": "system",
        "content": "<system prompt text>",
        "tools": [
          {
            "name": "<tool name>",
            "args": {
              "<arg name>": "<arg value>"
            }
          }
        ]
      },
      {
        "role": "user",
        "content": "<user prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response text>"
      },
// ...

搜索和长文件功能出现问题

System Info / 系統信息

win11,cuda12.4,torch2.3.1+cu121

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

网络搜索的错误信息:

服务器端:
D:\LLM\GLM-4>cd browser

D:\LLM\GLM-4\browser>pnpm start

[email protected] start D:\LLM\GLM-4\browser
npx ts-node src/server.ts

LOG_LEVEL debug
info: ⚡️[server]: Server is running at http://localhost:3000
info: session_id: 031d7fef-4a6c-4e86-83cc-703b10586214
info: action: search("GLM-4-9B") <|observation|>
info: Action line: search("GLM-4-9B") <|observation|>
debug: SimpleBrowser action search GLM-4-9B
debug: Searching for: GLM-4-9B
error: parse error: TypeError: Cannot read properties of undefined (reading 'value')
info: session_id: 031d7fef-4a6c-4e86-83cc-703b10586214
info: action: search("GLM-4-9B model") <|observation|>
info: Action line: search("GLM-4-9B model") <|observation|>
debug: SimpleBrowser action search GLM-4-9B model
debug: Searching for: GLM-4-9B model
error: parse error: TypeError: Cannot read properties of undefined (reading 'value')

模型端:
===BROWSER_RESPONSE===
{'contentType': 'system_error',
'metadata': {'failedCommand': 'search("GLM-4-9B") <|observation|>'},
'result': 'Error when executing command search("GLM-4-9B") <|observation|>\n'
'Error: Network or server error occurred',
'roleMetadata': 'system_error'}
===BROWSER_RESPONSE===
{'contentType': 'system_error',
'metadata': {'failedCommand': 'search("GLM-4 9B") <|observation|>'},
'result': 'Error when executing command search("GLM-4 9B") <|observation|>\n'
'Error: Network or server error occurred',
'roleMetadata': 'system_error'}
===BROWSER_RESPONSE===
{'contentType': 'system_error',
'metadata': {'failedCommand': 'search("GLM-4-9B") <|observation|>'},
'result': 'Error when executing command search("GLM-4-9B") <|observation|>\n'
'Error: Network or server error occurred',
'roleMetadata': 'system_error'}
===BROWSER_RESPONSE===
{'contentType': 'system_error',
'metadata': {'failedCommand': 'search("GLM-4-9B model") <|observation|>'},
'result': 'Error when executing command search("GLM-4-9B model") '
'<|observation|>\n'
'Error: Network or server error occurred',
'roleMetadata': 'system_error'}

上传一个文件后,没有任何操作就报错了,信息如下:
Traceback (most recent call last):
File "D:\LLM\GLM-4\runtime\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 600, in _run_script
exec(code, module.dict)
File "D:\LLM\GLM-4\src\main.py", line 162, in
with open(file_path, "wb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp\f9113ffe-aa83-42e3-beac-f3359cdc1c15.py'

对应的代码如下:
if page == Mode.LONG_CTX:
if first_round:
uploaded_files = st.file_uploader(
"上传文件",
type=["pdf", "txt", "py", "docx", "pptx", "json", "cpp", "md"],
accept_multiple_files=True,
)
if uploaded_files and not st.session_state.files_uploaded:
uploaded_texts = []
for uploaded_file in uploaded_files:
file_name: str = uploaded_file.name
random_file_name = str(uuid4())
file_extension = os.path.splitext(file_name)[1]
file_path = os.path.join("/tmp", random_file_name + file_extension)
with open(file_path, "wb") as f:
f.write(uploaded_file.getbuffer())
if file_name.endswith(".pdf"):
content = extract_pdf(file_path)
elif file_name.endswith(".docx"):
content = extract_docx(file_path)
elif file_name.endswith(".pptx"):
content = extract_pptx(file_path)
else:
content = extract_text(file_path)
uploaded_texts.append(
FILE_TEMPLATE.format(file_name=file_name, file_content=content)
)
os.remove(file_path)
st.session_state.uploaded_texts = "\n\n".join(uploaded_texts)
st.session_state.uploaded_file_nums = len(uploaded_files)
else:
st.session_state.uploaded_texts = ""
st.session_state.uploaded_file_nums = 0
elif page == Mode.VLM:
if first_round:
uploaded_image = st.file_uploader(
"上传图片",
type=["png", "jpg", "jpeg", "bmp", "tiff", "webp"],
accept_multiple_files=False,
)
if uploaded_image:
data: bytes = uploaded_image.read()
image = Image.open(BytesIO(data)).convert("RGB")
st.session_state.uploaded_image = image
else:
st.session_state.uploaded_image = None

Expected behavior / 期待表现

正常运行

requirements 安装失败

System Info / 系統信息

python 3.11.8
cuda 12.1
OS win10

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

(venv) D:\GLM-4>python -m pip install -r D:\GLM-4\basic_demo\requirements.txt
Collecting torch>=2.3.0 (from -r D:\GLM-4\basic_demo\requirements.txt (line 1))
Using cached torch-2.3.1-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting torchvision>=0.18.0 (from -r D:\GLM-4\basic_demo\requirements.txt (line 2))
Using cached torchvision-0.18.1-cp311-cp311-win_amd64.whl.metadata (6.6 kB)
Collecting transformers==4.40.0 (from -r D:\GLM-4\basic_demo\requirements.txt (line 3))
Using cached transformers-4.40.0-py3-none-any.whl.metadata (137 kB)
Collecting huggingface-hub>=0.23.1 (from -r D:\GLM-4\basic_demo\requirements.txt (line 4))
Using cached huggingface_hub-0.23.3-py3-none-any.whl.metadata (12 kB)
Collecting sentencepiece>=0.2.0 (from -r D:\GLM-4\basic_demo\requirements.txt (line 5))
Using cached sentencepiece-0.2.0-cp311-cp311-win_amd64.whl.metadata (8.3 kB)
Collecting pydantic>=2.7.1 (from -r D:\GLM-4\basic_demo\requirements.txt (line 6))
Using cached pydantic-2.7.3-py3-none-any.whl.metadata (108 kB)
Collecting timm>=0.9.16 (from -r D:\GLM-4\basic_demo\requirements.txt (line 7))
Using cached timm-1.0.3-py3-none-any.whl.metadata (43 kB)
Collecting tiktoken>=0.7.0 (from -r D:\GLM-4\basic_demo\requirements.txt (line 8))
Using cached tiktoken-0.7.0-cp311-cp311-win_amd64.whl.metadata (6.8 kB)
Collecting accelerate>=0.30.1 (from -r D:\GLM-4\basic_demo\requirements.txt (line 9))
Using cached accelerate-0.30.1-py3-none-any.whl.metadata (18 kB)
Collecting sentence_transformers>=2.7.0 (from -r D:\GLM-4\basic_demo\requirements.txt (line 10))
Using cached sentence_transformers-3.0.0-py3-none-any.whl.metadata (10 kB)
Collecting vllm>=0.4.3 (from -r D:\GLM-4\basic_demo\requirements.txt (line 11))
Using cached vllm-0.4.3.tar.gz (693 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
Traceback (most recent call last):
File "D:\GLM-4\venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "D:\GLM-4\venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\GLM-4\venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Win11\AppData\Local\Temp\pip-build-env-8ehg80mm\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Win11\AppData\Local\Temp\pip-build-env-8ehg80mm\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "C:\Users\Win11\AppData\Local\Temp\pip-build-env-8ehg80mm\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup
exec(code, locals())
File "", line 36, in
AssertionError: vLLM only supports Linux platform (including WSL).
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Expected behavior / 期待表现

thx

glm4 9b 1m 启动报错

System Info / 系統信息

win10
cuda 11.8
python 3.12
transformer 4.40
GPU A4000

Who can help? / 谁可以帮助到您?

anyone who could help

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

启动composite demo后页面出现,点击 all tools 或文本分析任意按钮 报错

KeyError: '<|endoftext|>'
Traceback:
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 575, in _run_script
self._session_state.on_script_will_rerun(
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\state\safe_session_state.py", line 65, in on_script_will_rerun
self._state.on_script_will_rerun(latest_widget_states)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\state\session_state.py", line 517, in on_script_will_rerun
self._call_callbacks()
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\state\session_state.py", line 530, in _call_callbacks
self._new_widget_state.call_callback(wid)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\state\session_state.py", line 274, in call_callback
callback(*args, **kwargs)
File "C:\Users\XXX\glm4\GLM-4\composite_demo\src\main.py", line 123, in page_changed
st.session_state.client = build_client(Mode(new_page))
File "C:\Users\XXX\glm4\GLM-4\composite_demo\src\main.py", line 107, in build_client
return get_client(CHAT_MODEL_PATH, typ)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 165, in wrapper
return cached_func(*args, **kwargs)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 194, in call
return self._get_or_create_cached_value(args, kwargs)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 221, in _get_or_create_cached_value
return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 277, in _handle_cache_miss
computed_value = self._info.func(*func_args, **func_kwargs)
File "C:\Users\XXX\glm4\GLM-4\composite_demo\src\client.py", line 89, in get_client
return HFClient(model_path)
File "C:\Users\XXX\glm4\GLM-4\composite_demo\src\clients\hf.py", line 18, in init
self.tokenizer = AutoTokenizer.from_pretrained(
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 678, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\tokenization_utils_base.py", line 1825, in from_pretrained
return cls._from_pretrained(
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\tokenization_utils_base.py", line 2061, in _from_pretrained
added_tokens = tokenizer.sanitize_special_tokens()
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\tokenization_utils_base.py", line 856, in sanitize_special_tokens
return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\tokenization_utils_base.py", line 999, in add_tokens
return self._add_tokens(new_tokens, special_tokens=special_tokens)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\tokenization_utils.py", line 421, in _add_tokens
and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\tokenization_utils.py", line 575, in convert_tokens_to_ids
return self._convert_token_to_id_with_added_voc(tokens)
File "C:\Users\XXX\anaconda3\envs\GLM4\lib\site-packages\transformers\tokenization_utils.py", line 588, in _convert_token_to_id_with_added_voc
return self._convert_token_to_id(token)
File "C:\Users\XXX/.cache\huggingface\modules\transformers_modules\glm-4-9b-chat-1m\tokenization_chatglm.py", line 95, in _convert_token_to_id
return self.mergeable_ranks[token]

Expected behavior / 期待表现

期待正常启动

GLM4的Qlora训练问题

Feature request / 功能建议

我在自己的脚本上进行了对ChatGLM3-6b和GLM4-9b的消融实验,发现了一些很奇怪的事
ChatGLM3的训练参数 :Qlora(4-bit), rank=256, target_module="query_key_value", max_token_len=32768, bs=1, gradient_checkpoing=True, bf16
这是可以正常训练的,显存仅用了50G
但来到了GLM4-9B
训练参数 :Qlora(4-bit), rank=16, target_module="query_key_value", max_token_len=8000, bs=1, gradient_checkpoing=True, bf16
这样训练的显存一直OOM,可以看到lora_rank和max_token_len的量级都大幅度下降了,不过是6B和9B的差距会导致无法训练吗?

值得一提的是,GLM4的这套参数我可以训练Qwen32B-Chat,不应该训练不起9B

同时,我发现GLM3-6B相对其他7B模型(如:mistral7B)的训练、推理速度都大幅度领先,且显存量要远少于其他模型,很好奇这是为什么?

Motivation / 动机

Your contribution / 您的贡献

OCR效果

感谢作者的开源!

我在使用过程中发现GLM4V-9B的OCR效果非常棒,想问一下会有相关的技术报告发布嘛?可以解答下大概用了多少OCR数据吗,是来自公开数据还是私有数据的呢

能否将openai_api_server.py升级到gpt4的tools调用相兼容的api

System Info / 系統信息

接口兼容性问题,与硬件无关

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

  1. 运行basic_demo/openai_api_server.py脚本
  2. 调用/v1/chat/completions接口,传入工具调用的api
    { "messages": [ { "content": "What's the weather like in San Francisco", "role": "user" } ], "model": "glm4", "stream": false, "temperature": 0.8, "tools": [ { "type": "function", "function": { "description": "根据传入的城市获取天气信息", "name": "getCurrentWeather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "城市名称,如:武汉" } }, "required": [ "location" ] } } } ], "tool_choice": "auto" }
  3. 返回的是gpt3版本的function_call
    { "model": "glm4", "id": "", "object": "chat.completion", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "getCurrentWeather\n{\"location\": \"San Francisco\"}", "name": null, "function_call": { "name": "getCurrentWeather", "arguments": "{\"location\": \"San Francisco\"}" } }, "finish_reason": "function_call" } ], "created": 1717641432, "usage": { "prompt_tokens": 157, "total_tokens": 168, "completion_tokens": 11 } }

Expected behavior / 期待表现

期望能够返回与GPT4最新的tool_calls调用相兼容的api
{ "model": "glm4", "id": "", "object": "chat.completion", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "getCurrentWeather\n ```python\ntool_call(location='San Francisco')\n```", "tool_calls": [ { "id": "call0", "type": "function", "function": { "name": "getCurrentWeather", "arguments": "{\"location\": \"San Francisco\"}" } } ] }, "finish_reason": "tool_calls" } ], "created": 1717635659, "usage": { "prompt_tokens": 162, "total_tokens": 183, "completion_tokens": 21 } }

basic_demo/openai_api_server.py 启动的服务通过流式接口调用没有结果返回

System Info / 系統信息

非流式请求有结果返回,应该不是硬件问题

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

  1. 运行basic_demo/openai_api_server.py脚本
  2. 调用/v1/chat/completions接口,传入的stream参数为true
  3. 没有结果返回
image stream为false时有结果返回 image

Expected behavior / 期待表现

正常流式返回

model(**batch)报错

System Info / 系統信息

deepspeed==0.14.2
torch==2.3.0
transformers==4.41.2

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

调用model(**batch)报错:
batch:

{'input_ids': tensor([[151331, 151333, 151331,  ...,      0,      0,      0],                                                                                                   
        [151331, 151333, 151331,  ...,      0,      0,      0],                                                                                                                 
        [151331, 151333, 151331,  ...,      0,      0,      0],
        ...,
        [151331, 151333, 151331,  ...,      0,      0,      0],
        [151331, 151333, 151331,  ...,      0,      0,      0],
        [151331, 151333, 151331,  ..., 114689,  99153, 151329]]), 'labels': tensor([[151331, 151333, 151331,  ...,      0,      0,      0],
        [151331, 151333, 151331,  ...,      0,      0,      0],
        [151331, 151333, 151331,  ...,      0,      0,      0],
        ...,
        [151331, 151333, 151331,  ...,      0,      0,      0],
        [151331, 151333, 151331,  ...,      0,      0,      0],
        [151331, 151333, 151331,  ..., 114689,  99153, 151329]]), 'attention_mask': tensor([[ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        ...,
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ...,  True,  True, False]])}

报错信息:

Traceback (most recent call last):                                                                                                                                              
  File "<console>", line 1, in <module>                                                                                                                                         
  File "/root/anaconda3/envs/pytorch_build/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl                                            
    return self._call_impl(*args, **kwargs)                                             
  File "/root/anaconda3/envs/pytorch_build/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl                                                    
    return forward_call(*args, **kwargs)                                                                                                                                        
  File "/root/anaconda3/envs/pytorch_build/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward           
    output = module._old_forward(*args, **kwargs)                                                                                                                               
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 881, in forward                            
    transformer_outputs = self.transformer(                                                                                                                                     
  File "/root/anaconda3/envs/pytorch_build/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl                                            
    return self._call_impl(*args, **kwargs)                                                                                                                                     
  File "/root/anaconda3/envs/pytorch_build/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl                                                    
    return forward_call(*args, **kwargs)                                                                                                                                        
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 767, in forward                                                          
    full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask)                                                                               
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 668, in get_masks                    
    full_attention_mask -= padding_mask.unsqueeze(-1) - 1                                                                                                                       
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `logical_not()` operator instead.       

Expected behavior / 期待表现

希望不会报错

智谱流失输出的时候结果为空

System Info / 系統信息

和 requirements.txt一样

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1.修改basic_demo/openai_api_rēquest.py main()函数为simple_chat(True)
2.启动python server.py
3.python request.py

Expected behavior / 期待表现

流式和非流失输出一致

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.