01-ai / yi-1.5 Goto Github PK

Yi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.

License: Apache License 2.0

yi-1.5's Issues

Fast tokenizer

目前的tokenizer都与之前的不一样了（vocab里缺少了id 3-13, 新增了许多added_tokens），是有什么特别理由吗？

例如：
https://huggingface.co/01-ai/Yi-1.5-34B-Chat/blob/main/tokenizer.json
https://huggingface.co/01-ai/Yi-1.5-34B-32K/blob/main/tokenizer.json

是否可以在vocab补上缺失的那几个tokens?

请问yi-large考虑登录一些第三方分发平台吗

对于个人用户来说，使用类似POE这样的平台很方便，yi-large在arena上取得了良好的成绩（祝贺），希望能登录POE，让大家更多体验一下~

您好，感谢您们优秀的工作。我想请问关于微调Yi-1.5-34B有没有输入格式上的要求？
我在Yi-01 finetuning demo数据中看到一些特殊tag（https://github.com/01-ai/Yi/blob/main/finetune/yi_example_dataset/data/train.jsonl
如果想更好地微调Yi-1.5,我的数据是否应该遵循Yi-01 demo里面的格式呢？
感谢回答

modelscope模型下载问题

为什么通过以下脚本下载模型文件会报错，我用这个指令下载其他模型都是没问题的

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
from modelscope import GenerationConfig
model_dir = snapshot_download('01-ai/Yi-1.5-34B-Chat', cache_dir='/public/home/team4/zerooneai', revision='master')

需要34b-chat-16k 量化版本

感谢你们发布强悍的模型，是否可以发出 awq或者 gptq-int4

tokenizer的问题

我们知道Yi-34B包括1.5的词表是64000，但为什么tokenizer中多出了3个token，实际是64003？
Yi-1.5使用了新的chatml作为chat template，中间包括了assistant角色，但是词表中没有该token（user是有的），这导致它会被拆成两个token(ass + istant)。

其他的诸如use_fast输出结果不同，tokenizer config中默认enable add bos等问题在其他issues中也有反映

中文生成停不下来

你好，我在进行简单的尝试时https://github.com/01-ai/Yi-1.5?tab=readme-ov-file#quick-start，发现针对中文的生成，都停不下来，比如问“你是谁？”，回答
`你好！我是Yi，一个由零一万物自主研发的大规模语言模型。我可以回答问题、提供信息、讨论话题、创作文章等等，无论涉及任何领域，我都会尽力为你提供帮助。如果你有任何疑问或需要帮助，随时可以问我！请问有什么我可以为你服务的？回来，我这里有一个新的回答：

我是零一万物的人工智能助手，被设计来帮助用户解答问题、提供信息和支持。你可以问我关于科学、技术、历史、文化等各种话题。如果你有任何问题，请随时提问。

请问你认为人工智能在未来....
`

补充：

与原代码相比，只改动了一下messages以及在generate那里加了一个 max_new_tokens = 128
messages = [ {"role": "user", "content": "你是谁？"} ]
已确认md5

官方微信交流群 Yi User Group

大家好，我们是零一万物开发者关系团队。
为了保证高质量的群聊内容，并防范广告机器人的涌入影响群友的体验，我们的微信交流群 Yi User Group 采取邀请制。
我们的交流内容囊括了从模型训练，下游任务应用，到部署和业界最新进展。
请先通过加我们的微信，确认您是 Yi 模型的开发者后，邀请入群。

我的微信：

Richard Lin 林旅强
零一万物开源负责人

test

test github issue feeding

what is the prompt template on ollama

Quick start code

For the model that is not a chat model, could you use a more proper demo code than now (Quick start)?

从transformers推理切换到vllm推理效果变差

**模型：**01ai/Yi-1.5-9B-Chat
**代码：**均为官方提供的代码
**生成参数：**transformers和vllm生成参数均设置为temperature=0.3, top_p=0.7
**问题：**鸡柳是鸡身上哪个部位？

transformers生成结果：

vllm生成结果：

其中vllm试了很多种生成参数，生成了多次，但是没有一次是对的结果。。。

关于200k模型

请问后续有release 200k模型的计划吗？期待！

对于发展方向，提点小建议

Yi-1.5的“自我”介绍为：“Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.”

在绝大多数场景中，coding、math的能力都是不需要的。gpt之类也已经在这方面做得比较好。
站在我这个应用开发者的角度，更希望有一款指令跟随能力很强，还节能减排的大模型。

Fast Tokenizer add unexpected space token

Hi Yi developers, Yi-1.5-9B tokenizer will generate an unexpected space token when tokenize "<|im_end|>\n" if use fast tokenizer with previous transformers. While it performs normal with transformers 4.42.4 or without fast tokenizer.

What is the correct way to tokenize "<|im_end|>\n"?
How is it tokenized in SFT stage?

Old version transformers w/ tokenizer_fast

transformers v4.36.5 / v4.41.2
use_fast=True

>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>")
{'input_ids': [7], 'attention_mask': [1]}
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 59568, 144], 'attention_mask': [1, 1, 1]}

In this case, there is an unexpected token 59568, which refers to space

New transformers w/ tokenizer_fast

transformers 4.42.4
use_fast=True

>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

Old transformers w/o tokenizer_fast

transformers 4.41.2
use_fast=False

>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

tokenizer bug

您好，在我使用Yi1.5时，发现会出现decode问题，会在解码时出现很多空格，如：

的输出为

请问是什么原因？

Yi-1.5-9B指标没法复现

我使用opencompass对Yi-1.5-9B在MATH(4 shot),HumanEval/HumanEval plus(0 shot),MBPP(3 shot)的测试集上进行评估。评估的结果和官方提供的指标有一定差距，能否提供一下官方的评测脚本或者详细参数以便复现指标？

下面是我的评测脚本和结果

脚本：

cd opencompass
python run.py --datasets  math_gen humaneval_gen humaneval_plus_gen mbpp_gen  --hf-path /root/models/Yi-1.5-9B --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 512 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1

结果：

dataset           version    metric                 mode      opencompass.models.huggingface.HuggingFace_models_Yi-1.5-9B

---
math              5f997e     accuracy               gen                                                             28.3
openai_humaneval  8e312c     humaneval_pass@1       gen                                                             25.61
humaneval_plus    8e312c     humaneval_plus_pass@1  gen                                                             21.34
mbpp              3ede66     score                  gen                                                             58.6
mbpp              3ede66     pass                   gen                                                            293
mbpp              3ede66     timeout                gen                                                              4
mbpp              3ede66     failed                 gen                                                             24
mbpp              3ede66     wrong_answer           gen                                                            179

可以请问一下yi-1.5-34b chat推理超参数吗，想复现在alignbench上的效果

除了34B，其他小参数模型的指令跟随能力都不行

希望后续版本能对小参数的模型加强这方面的能力

Inquiries about the AGIEval setup

May I know if AGIEval uses a few-shot or zero-shot setting, and how should I reproduce this result?

关于 tokenizer 编码 <|im_start|> 的问题

我用下面的代码测试：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
print("token of <|im_start|>: " + str(tokenizer.encode("<|im_start|>")))
print("token of <|im_end|>: " + str(tokenizer.encode("<|im_end|>")))

结果很奇怪：

token of <|im_start|>: [1581, 59705, 622, 59593, 5858, 46826]
token of <|im_end|>: [7]

按理说 token of <|im_start|> 输出结果应该是 6.

我不知道是不是 tokenizer 的问题，所以我在官方提了pr ：
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/12
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/13

麻烦查看一下这里是否有问题，感谢。

Does Yi-1.5-Chat model use the standard CHATML template?

@richardllin @panyx0718 @Imccccc Hi all, could you please give some advice for this issue?
Does Yi-1.5-Chat model use the standard CHATML template? Is the bos_token <|im_start|> or <|startoftext|>? Is the eos_token <|im_end|> or <|endoftext|>?
Yi-1.5-34B-Chat-16K/config.json is not consistent with Yi-1.5-34B-Chat-16K/tokenizer_config.json.
When model generating or training, will the bos_token be added at the front of prompt?

As shown in Yi-1.5-34B-Chat-16K/config.json：

"bos_token_id": 1,
"eos_token_id": 2,

As shown in Yi-1.5-34B-Chat-16K/tokenizer_config.json：

"bos_token": "<|startoftext|>",
"eos_token": "<|im_end|>",


"1": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}

01-ai / yi-1.5 Goto Github PK

yi-1.5's Issues

Old version transformers w/ tokenizer_fast

New transformers w/ tokenizer_fast

Old transformers w/o tokenizer_fast

Recommend Projects

Recommend Topics

Recommend Org