Giter Site home page Giter Site logo

skyworkai / skywork Goto Github PK

View Code? Open in Web Editor NEW
1.1K 1.1K 97.0 8.33 MB

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation methods, etc. 天工系列模型在3.2TB高质量多语言和代码数据上进行预训练。我们开源了模型参数,训练数据,评估数据,评估方法。

License: Other

Shell 13.36% Python 86.64%
llm

skywork's People

Contributors

bbuf avatar eltociear avatar jianxindong avatar tianwenwei avatar zhao1iang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skywork's Issues

关于评测集的选择和使用

  1. 看您的介绍中有提到说构建了中文,英文,代码,arxiv文章等多个领域的验证集,请问这些验证集有开源的打算吗?因为目前我看只有(技术文章 | 电影评论 | 政务报告 | 游戏 | 金融 | 通用领域)这六个类型,可能和通用中英文bencmark相关的验证集对大家来说更有实用性~
  2. 就关于目前您开源的这六个验证集,我看其中基本来源基本都是新闻。请问您有尝试过将通用中英文bencmark的数据改写成sequence的形式来进行模型效果验证吗,或者说对于特定领域的任务,尝试将特定领域的数据改写成sequence来进行验证么? ~

BUG:ceval, cmmlu和mmlu中选项ABCD的概率计算错误

Skywork/eval/文件夹下的evaluate_ceval.py, evaluate_cmmlu.py和evaluate_mmlu.py文件中,获取选项ABCD的概率的关键代码如下:

    softval = torch.nn.functional.softmax(
        torch.tensor(
            [
                logits[tokenizer("A")["input_ids"][-1]],
                logits[tokenizer("B")["input_ids"][-1]],
                logits[tokenizer("C")["input_ids"][-1]],
                logits[tokenizer("D")["input_ids"][-1]],
            ]
        ),
        dim=0,
    )

以选项A为例:
tokenizer("A")会把“A“认为是一个句子,在”A“前面拼接句子开始标志”_“。因此tokenizer实际上转化的字符为“<s> _A”,得到input_ids=[1, 319]。代码中tokenizer("A")["input_ids"][-1]取得的id是319,对应的字符为“_A”,而真正“A”字符对应的id是:

tokenizer.convert_tokens_to_ids('A')=29909.

BCD选项也存在同样的问题。

评估时的一个full_prompt的例子格式如下:

以下是关于农学的单项选择题,请直接给出正确答案的选项。

题目:肉牛屠宰后,胴体的哪个部位肉质较好
A. 胸
B. 腹
C. 大腿
D. 小腿
答案:C

……

题目:羊胴体中,肉质较好的部位是
A. 胸下肉
B. 肩胛肉
C. 后腿肉
D. 小腿肉
答案:C

以下是关于农学的单项选择题,请直接给出正确答案的选项。

题目:在农业生产中被当作极其重要的劳动对象发挥作用,最主要的不可替代的基本生产资料是
A. 农业生产工具
B. 土地
C. 劳动力
D. 资金
答案:

根据full_prompt例子的格式,选项应该填在“答案:”后面,不应该另起一行。
因此选择ABCD选项的id时,应该取“A”“B”“C”"D"字符的概率,而不是“_A”,"_B","_C","_D"字符的概率。

BUG:评测loss计算中attention_mask有误

你好,我看到你们的 eval_loss.py 中计算loss的时候,attention_mask 取的是 attention_mask[:, :-1],同时是采用的 right padding。这样的做法在batch情况下,即有padding的时候,会导致非最大长度的那些句子会额外在最后计算一个生成padding_token的loss,并且该loss的数量级通常大于正文的数量级,会造成结果有误。在right padding的前提下,改成 attention_mask[:, 1:] 即可解决该问题。

Questions about eval_loss.py

In Line 58, we calculate the number of tokens using attention_mask = attention_mask[:, :-1] and torch.sum(attention_mask).item(). But do we need to shift attention mask? Maybe torch.sum(attention_mask).item() - batch_size(without shifting) is correct?

For example if the batch size is 2, the input_ids can be [[1, 2, 3], [1, 2, pad]] and the attention mask is [[True, True, True], [True, True, False]]. Using attention_mask = attention_mask[:, :-1] and torch.sum(attention_mask).item() will output 4 as the number of tokens. But actually the token number should be 3 because we only calculate logits on [2, 3] and [2, pad] (first label is shifted by label = label[:, 1: ]) and pad isn't counted as a valid token for calculating loss. If we set IGNORE_INDEX in labels according to attention_mask, we don't need shifted attention mask when calculating loss.

A code example:

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-13B-base", use_fast=False, trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token

tokenized_texts = tokenizer(["This is an example text.", "这是一个实例文本,这句话比较长。"], add_special_tokens=False, padding=True, truncation=True, max_length=128, return_tensors="pt")

input_ids = tokenized_texts.input_ids
attention_mask = tokenized_texts.attention_mask

print(f"Input sequence length: {input_ids.size(1)}")
print("Input labels:")
print(input_ids)
print("Input attention mask:")
print(attention_mask)
print(f"num_tokens: {torch.sum(attention_mask).item() - input_ids.size(0)}")

shift_attention_mask = attention_mask[:, :-1]
print("Shifted attention mask:")
print(shift_attention_mask)
print(f"num_tokens: {torch.sum(shift_attention_mask).item()}")

output is:

Input sequence length: 13
Input labels:
tensor([[  910,   338,   385,  1342,  1426, 29889,     2,     2,     2,     2,
             2,     2,     2],
        [29871, 30810, 30392, 41176, 50921, 45522, 30214, 30810, 32760, 31852,
         40579, 31143, 30267]])
Input attention mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
num_tokens: 17
Shifted attention mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
num_tokens: 18

But from the Input labels output, it is obvious that the token num should be 17 (first sample 338 to 29889 and second 30810 to 30267).

save的checkpoint里面没有bin文件

大佬们,有一小问题,想请教一下,我改动了config文件,改成了zero3的时候,保存下来的checkpoint里面没有bin文件 ,是不是我哪一步没有调对?还是说不能用zero3?

1699599845606

关于预训练数据拼接

你好,关于预训练数据的处理方式,我有一些疑惑:
从项目中提供的预训练数据示例文件pt_train.jsonl和训练代码来看,你们对于预训练数据是单条直接padding到sequence_length的做法,padding直接使用eos字符;在我之前了解的一些项目里,做法是将所有训练数据分词后拼接再切片;我个人对于这两种处理方式的理解是:如果直接拼接再切片可以保证训练中几乎所有tokens都是有效的,而如果单独处理则会计算大量的padding的tokens。
请问你们在实际训练的时候是否预先将许多单条数据拼接到接近sequence_length组成新的一条数据?

评测数据集MOCK_GSM8K_TEST使用方式

您好,我正在尝试在其他模型上复现该实验,想请问在这个实验中,这个数据集的使用方式是否与原版的GSM8k有所不同?因为我没有看到有question相关的prompt部分

当前eval_loss脚本不支持chatglm系列模型

如题,当前eval_loss脚本不支持chatglm系列模型。可否补上相关评测。
或者怎么改支持呢,我试了下好像chatglm的tokenizer属性不太一致,并且padding_side也和eval中写的不一样,导致强行用这个脚本测出的loss是inf。

论文链接是百川大模型的技术报告

url arXiv: 2309.10305是百川大模型的技术报告

@Article{skyworkmath,
title={SkyMath: Technical Report},
author={Liu Yang, Haihua Yang, Wenjun Cheng, Lei Lin, Chenxia Li, Yifu Chen, Lunan Liu, Jianfei Pan, Tianwen Wei, Biye Li, Liang Zhao, Lijie Wang, Bo Zhu, Guoliang Li, Xuejie Wu, Xilin Luo, Rui Hu},
journal={arXiv preprint arXiv: 2310.16713},
url={https://arxiv.org/abs/2309.10305},
year={2023}
}

Would you provide more information about SkyMath?

Your paper suggested Instruction Boosting and Self-Compare FT would be very helpful but IB looks like Wizard-Evol and IB is very similar to PHP and according to the tech report, I cannot tell what are the differences between them.

报错Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
是GPU不够吗
if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
-> 3246 raise ValueError(
3247 """
3248 Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
...
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

验证集评测结果复现与报告表不符

你好,我有一些问题,希望解答一下:
采用Qwen-14B运行给定的命令:bash bash_scripts/skywork_eval_loss.sh:

  • 平均结果为2.424与报告中的结果9.67不一致。这是为什么?验证集发布不全吗?
截屏2023-11-07 19 44 29
  • 评测时为什么要对文本做截断?截断是对输入句子长度做截断而不是对token做截断?不同模型的max_length不一致,max token都采用4096吗?都用4096不适应所有对比模型。用各自模型的max length?如果是的话,每个模型的输入长度不一样,评测就不公平了?
  • 不截断的话,现有评测代码不支持窗口滑动,也不适应所有对比模型的max length?

希望解答一下以上疑问,谢谢~

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

我使用公示的代码(预测结果,单卡A100) 尝试跑结果,大佬们。请问一下,可以给我一点点建议吗?

出现了下面的错误信息
....
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0 return self._call_impl(*args, **kwargs)
], thread: [11,0 File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [19,0 return forward_call(*args, **kwargs)
,0] Assertion srcIndex < srcSelectDimSize File "/home/ethan/.cache/huggingface/modules/transformers_modules/Skywork-13B-base/modeling_skywork.py", line 726, in forward failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [24,0,0 outputs = self.model(
] Assertion srcIndex < srcSelectDimSize failed.
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1239: indexSelectSmallIndex: block: [25,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
return self._call_impl(*args, **kwargs)
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ethan/.cache/huggingface/modules/transformers_modules/Skywork-13B-base/modeling_skywork.py", line 641, in forward
layer_outputs = decoder_layer(
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ethan/.cache/huggingface/modules/transformers_modules/Skywork-13B-base/modeling_skywork.py", line 449, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ethan/.cache/huggingface/modules/transformers_modules/Skywork-13B-base/modeling_skywork.py", line 346, in forward
query_states = self.q_proj(hidden_states)
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/SSD_12TB/ethan/application/anaconda3/envs/sky/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

官方提供 磁力链接 数据源吗

Mistral AI 昨天发布了一条仅包含磁力链接的推文,很有启发意义。
网友打开该磁力链接后发现居然是一个大小为 87 GB 的种子。从命名和目录结构来看,这是一个 PyTorch 模型文件。
Mistral AI 这次 “开源” 的 mixtral-8x7b-32kseqlen 是一个基于混合专家 (Mixture of Experts, MoE) 的大模型,由 8 个 70 亿参数规模 (8×7b) 的专家网络组成。据称是全球首个开源 MoE 大模型

我感觉应该绕开 Hugging Face

这样**地区可以快速下载

legacy behaviour of the <SkyworkTokenizer'> This means that tokens that come after special tokens will not be properly handled.

When loading tokenizer with transformers.AutoTokenizer we receive a warning: You are using the legacy behaviour of the <class 'transformers_modules.Skywork.Skywork-13B-base.98a59dec44df3a8fd8fcd4bac07e94db35219eb1.tokenization_skywork.SkyworkTokenizer'> This means that tokens that come after special tokens will not be properly handled.

We already update transformers from 4.31.0 to 4.34.0, but we face the same warning in both version.
截屏2023-11-01 19 16 53

Code open-sourced?

The code in the train folder seems to be only applicable for fine-tuning the model. Will the pre-training code for the model be open-sourced?

Why releasing 13B-model instead of smaller ones, say, 7B?

In your tech report Chaper 3 you made a comparison between llama-7B and (your) GPT-7B, but you finally released a 13B model. So there are two questions:

  1. will you release a smaller model ?
  2. why do you design your model as the report listed? Does it perform better than the llama architecture?

ValueError: Trainer: evaluation requires an eval_dataset.

在预训练最后一个step时需要评估验证指标,因为没有指定eval data而报错了,请问怎么关掉这个?ValueError: Trainer: evaluation requires an eval_dataset.
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3062, in evaluate
eval_dataloader = self.get_eval_dataloader(eval_dataset)
File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 888, in get_eval_dataloader
raise ValueError("Trainer: evaluation requires an eval_dataset.")
ValueError: Trainer: evaluation requires an eval_dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.