01-ai / yi-1.5 Goto Github PK
View Code? Open in Web Editor NEWYi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.
License: Apache License 2.0
Yi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.
License: Apache License 2.0
目前的tokenizer都与之前的不一样了(vocab里缺少了id 3-13, 新增了许多added_tokens),是有什么特别理由吗?
例如:
https://huggingface.co/01-ai/Yi-1.5-34B-Chat/blob/main/tokenizer.json
https://huggingface.co/01-ai/Yi-1.5-34B-32K/blob/main/tokenizer.json
是否可以在vocab补上缺失的那几个tokens?
对于个人用户来说,使用类似POE这样的平台很方便,yi-large在arena上取得了良好的成绩(祝贺),希望能登录POE,让大家更多体验一下~
您好,感谢您们优秀的工作。我想请问关于微调Yi-1.5-34B有没有输入格式上的要求?
我在Yi-01 finetuning demo数据中看到一些特殊tag(https://github.com/01-ai/Yi/blob/main/finetune/yi_example_dataset/data/train.jsonl
如果想更好地微调Yi-1.5,我的数据是否应该遵循Yi-01 demo里面的格式呢?
感谢回答
为什么通过以下脚本下载模型文件会报错,我用这个指令下载其他模型都是没问题的
import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
from modelscope import GenerationConfig
model_dir = snapshot_download('01-ai/Yi-1.5-34B-Chat', cache_dir='/public/home/team4/zerooneai', revision='master')
感谢你们发布强悍的模型,是否可以发出 awq或者 gptq-int4
其他的诸如use_fast输出结果不同,tokenizer config中默认enable add bos等问题在其他issues中也有反映
你好,我在进行简单的尝试时https://github.com/01-ai/Yi-1.5?tab=readme-ov-file#quick-start,发现针对中文的生成,都停不下来,比如问“你是谁?”,回答
`你好!我是Yi,一个由零一万物自主研发的大规模语言模型。我可以回答问题、提供信息、讨论话题、创作文章等等,无论涉及任何领域,我都会尽力为你提供帮助。如果你有任何疑问或需要帮助,随时可以问我!请问有什么我可以为你服务的?回来,我这里有一个新的回答:
我是零一万物的人工智能助手,被设计来帮助用户解答问题、提供信息和支持。你可以问我关于科学、技术、历史、文化等各种话题。如果你有任何问题,请随时提问。
请问你认为人工智能在未来....
`
补充:
max_new_tokens = 128
messages = [ {"role": "user", "content": "你是谁?"} ]
test github issue feeding
what is the prompt template on ollama
For the model that is not a chat model, could you use a more proper demo code than now (Quick start)?
请问后续有release 200k模型的计划吗?期待!
Yi-1.5的“自我”介绍为:“Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.”
在绝大多数场景中,coding、math的能力都是不需要的。gpt之类也已经在这方面做得比较好。
站在我这个应用开发者的角度,更希望有一款指令跟随能力很强,还节能减排的大模型。
Hi Yi developers, Yi-1.5-9B tokenizer will generate an unexpected space token when tokenize "<|im_end|>\n" if use fast tokenizer with previous transformers. While it performs normal with transformers 4.42.4 or without fast tokenizer.
What is the correct way to tokenize "<|im_end|>\n"?
How is it tokenized in SFT stage?
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>")
{'input_ids': [7], 'attention_mask': [1]}
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 59568, 144], 'attention_mask': [1, 1, 1]}
In this case, there is an unexpected token 59568, which refers to space
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}
我使用opencompass对Yi-1.5-9B在MATH(4 shot),HumanEval/HumanEval plus(0 shot),MBPP(3 shot)的测试集上进行评估。评估的结果和官方提供的指标有一定差距,能否提供一下官方的评测脚本或者详细参数以便复现指标?
下面是我的评测脚本和结果
cd opencompass
python run.py --datasets math_gen humaneval_gen humaneval_plus_gen mbpp_gen --hf-path /root/models/Yi-1.5-9B --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 512 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1
dataset version metric mode opencompass.models.huggingface.HuggingFace_models_Yi-1.5-9B
---
math 5f997e accuracy gen 28.3
openai_humaneval 8e312c humaneval_pass@1 gen 25.61
humaneval_plus 8e312c humaneval_plus_pass@1 gen 21.34
mbpp 3ede66 score gen 58.6
mbpp 3ede66 pass gen 293
mbpp 3ede66 timeout gen 4
mbpp 3ede66 failed gen 24
mbpp 3ede66 wrong_answer gen 179
希望后续版本能对小参数的模型加强这方面的能力
May I know if AGIEval uses a few-shot or zero-shot setting, and how should I reproduce this result?
我用下面的代码测试:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
print("token of <|im_start|>: " + str(tokenizer.encode("<|im_start|>")))
print("token of <|im_end|>: " + str(tokenizer.encode("<|im_end|>")))
结果很奇怪:
token of <|im_start|>: [1581, 59705, 622, 59593, 5858, 46826]
token of <|im_end|>: [7]
按理说 token of <|im_start|>
输出结果应该是 6.
我不知道是不是 tokenizer 的问题,所以我在官方提了pr :
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/12
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/13
麻烦查看一下这里是否有问题,感谢。
@richardllin @panyx0718 @Imccccc Hi all, could you please give some advice for this issue?
Does Yi-1.5-Chat model use the standard CHATML template? Is the bos_token <|im_start|> or <|startoftext|>? Is the eos_token <|im_end|> or <|endoftext|>?
Yi-1.5-34B-Chat-16K/config.json is not consistent with Yi-1.5-34B-Chat-16K/tokenizer_config.json.
When model generating or training, will the bos_token be added at the front of prompt?
As shown in Yi-1.5-34B-Chat-16K/config.json:
"bos_token_id": 1,
"eos_token_id": 2,
As shown in Yi-1.5-34B-Chat-16K/tokenizer_config.json:
"bos_token": "<|startoftext|>",
"eos_token": "<|im_end|>",
"1": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
Hello!
Tell me, will Yi-Large be published in open source?
model Yi-1.5-9B-chat UserWarning: Using the model-agnostic default max_length
(=20) to control the generation length. We recommend setting max_new_tokens
to control the maximum length of the generation.
4K上下文完全不够用啊,能出个16K的吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.