Giter Site home page Giter Site logo

yi-1.5's Issues

modelscope模型下载问题

为什么通过以下脚本下载模型文件会报错,我用这个指令下载其他模型都是没问题的

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
from modelscope import GenerationConfig
model_dir = snapshot_download('01-ai/Yi-1.5-34B-Chat', cache_dir='/public/home/team4/zerooneai', revision='master')

tokenizer的问题

  1. 我们知道Yi-34B包括1.5的词表是64000,但为什么tokenizer中多出了3个token,实际是64003?
  2. Yi-1.5使用了新的chatml作为chat template,中间包括了assistant角色,但是词表中没有该token(user是有的),这导致它会被拆成两个token(ass + istant)。

其他的诸如use_fast输出结果不同,tokenizer config中默认enable add bos等问题在其他issues中也有反映

中文生成停不下来

你好,我在进行简单的尝试时https://github.com/01-ai/Yi-1.5?tab=readme-ov-file#quick-start,发现针对中文的生成,都停不下来,比如问“你是谁?”,回答
`你好!我是Yi,一个由零一万物自主研发的大规模语言模型。我可以回答问题、提供信息、讨论话题、创作文章等等,无论涉及任何领域,我都会尽力为你提供帮助。如果你有任何疑问或需要帮助,随时可以问我!请问有什么我可以为你服务的?回来,我这里有一个新的回答:

我是零一万物的人工智能助手,被设计来帮助用户解答问题、提供信息和支持。你可以问我关于科学、技术、历史、文化等各种话题。如果你有任何问题,请随时提问。

请问你认为人工智能在未来....
`

补充:

  1. 与原代码相比,只改动了一下messages以及在generate那里加了一个 max_new_tokens = 128
    messages = [ {"role": "user", "content": "你是谁?"} ]
  2. 已确认md5

官方微信交流群 Yi User Group

大家好,我们是零一万物 开发者关系团队。
为了保证高质量的群聊内容,并防范广告机器人的涌入影响群友的体验,我们的微信交流群 Yi User Group 采取邀请制。
我们的交流内容囊括了从模型训练,下游任务应用,到部署和业界最新进展。
请先通过加我们的微信,确认您是 Yi 模型的开发者后,邀请入群。

我的微信:
image

Richard Lin 林旅强
零一万物 开源负责人

test

test github issue feeding

Quick start code

For the model that is not a chat model, could you use a more proper demo code than now (Quick start)?

从transformers推理切换到vllm推理效果变差

**模型:**01ai/Yi-1.5-9B-Chat
**代码:**均为官方提供的代码
**生成参数:**transformers和vllm生成参数均设置为temperature=0.3, top_p=0.7
**问题:**鸡柳是鸡身上哪个部位?

transformers生成结果:
image

vllm生成结果:
image

其中vllm试了很多种生成参数,生成了多次,但是没有一次是对的结果。。。

对于发展方向,提点小建议

Yi-1.5的“自我”介绍为:“Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.”

在绝大多数场景中,coding、math的能力都是不需要的。gpt之类也已经在这方面做得比较好。
站在我这个应用开发者的角度,更希望有一款指令跟随能力很强,还节能减排的大模型。

Fast Tokenizer add unexpected space token

Hi Yi developers, Yi-1.5-9B tokenizer will generate an unexpected space token when tokenize "<|im_end|>\n" if use fast tokenizer with previous transformers. While it performs normal with transformers 4.42.4 or without fast tokenizer.

What is the correct way to tokenize "<|im_end|>\n"?
How is it tokenized in SFT stage?

Old version transformers w/ tokenizer_fast

  • transformers v4.36.5 / v4.41.2
  • use_fast=True
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>")
{'input_ids': [7], 'attention_mask': [1]}
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 59568, 144], 'attention_mask': [1, 1, 1]}

In this case, there is an unexpected token 59568, which refers to space

New transformers w/ tokenizer_fast

  • transformers 4.42.4
  • use_fast=True
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

Old transformers w/o tokenizer_fast

  • transformers 4.41.2
  • use_fast=False
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

tokenizer bug

您好,在我使用Yi1.5时,发现会出现decode问题,会在解码时出现很多空格,如:
6531719891243_ pic

的输出为
6541719891252_ pic
请问是什么原因?

Yi-1.5-9B指标没法复现

我使用opencompass对Yi-1.5-9B在MATH(4 shot),HumanEval/HumanEval plus(0 shot),MBPP(3 shot)的测试集上进行评估。评估的结果和官方提供的指标有一定差距,能否提供一下官方的评测脚本或者详细参数以便复现指标?

下面是我的评测脚本和结果

  • 脚本:
cd opencompass
python run.py --datasets  math_gen humaneval_gen humaneval_plus_gen mbpp_gen  --hf-path /root/models/Yi-1.5-9B --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 512 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1
  • 结果:
dataset           version    metric                 mode      opencompass.models.huggingface.HuggingFace_models_Yi-1.5-9B

---
math              5f997e     accuracy               gen                                                             28.3
openai_humaneval  8e312c     humaneval_pass@1       gen                                                             25.61
humaneval_plus    8e312c     humaneval_plus_pass@1  gen                                                             21.34
mbpp              3ede66     score                  gen                                                             58.6
mbpp              3ede66     pass                   gen                                                            293
mbpp              3ede66     timeout                gen                                                              4
mbpp              3ede66     failed                 gen                                                             24
mbpp              3ede66     wrong_answer           gen                                                            179

关于 tokenizer 编码 <|im_start|> 的问题

我用下面的代码测试:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
print("token of <|im_start|>: " + str(tokenizer.encode("<|im_start|>")))
print("token of <|im_end|>: " + str(tokenizer.encode("<|im_end|>")))

结果很奇怪:

token of <|im_start|>: [1581, 59705, 622, 59593, 5858, 46826]
token of <|im_end|>: [7]

按理说 token of <|im_start|> 输出结果应该是 6.

我不知道是不是 tokenizer 的问题,所以我在官方提了pr :
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/12
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/13

麻烦查看一下这里是否有问题,感谢。

Does Yi-1.5-Chat model use the standard CHATML template?

@richardllin @panyx0718 @Imccccc Hi all, could you please give some advice for this issue?
Does Yi-1.5-Chat model use the standard CHATML template? Is the bos_token <|im_start|> or <|startoftext|>? Is the eos_token <|im_end|> or <|endoftext|>?
Yi-1.5-34B-Chat-16K/config.json is not consistent with Yi-1.5-34B-Chat-16K/tokenizer_config.json.
When model generating or training, will the bos_token be added at the front of prompt?

As shown in Yi-1.5-34B-Chat-16K/config.json:

"bos_token_id": 1,
"eos_token_id": 2,

As shown in Yi-1.5-34B-Chat-16K/tokenizer_config.json:

"bos_token": "<|startoftext|>",
"eos_token": "<|im_end|>",

"1": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.