Giter Site home page Giter Site logo

baichuan-inc / baichuan-7b Goto Github PK

View Code? Open in Web Editor NEW
5.6K 66.0 495.0 3.92 MB

A large-scale 7B pretraining language model developed by BaiChuan-Inc.

Home Page: https://huggingface.co/baichuan-inc/baichuan-7B

License: Apache License 2.0

Python 99.74% Shell 0.26%
artificial-intelligence ceval large-language-models natural-language-processing mmlu chatgpt gpt-4 huggingface llama chinese

baichuan-7b's People

Contributors

baichuan-assistant avatar bc-gpd avatar benywon avatar gradientguru avatar s-jol avatar xuehaipan avatar zmsn-2077 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

baichuan-7b's Issues

[Question] Evaluation的结果是基于刚pretrained完的,还是经过SFT+HFRL的?

Required prerequisites

Questions

抱歉,我没能找到足够的描述。
这些测评结果是基于刚pre-train完的checkpoint,还是已经做过SFT+HFRL了?

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

预训练数据epoch数

请问下,预训练中的数据都是走一个epoch吗?还是说中文数据部分会走多于一个epoch?

解决爆24G显存的方法

官方代码测试:

(python3.8) [baichuan@localhost baichuan-7B]$ python3 generate.py
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
登鹳雀楼->王之涣
夜雨寄北->李商隐
过零丁洋->文天祥
己亥杂诗(其五)->龚自珍

[Question] 指令微调的chat版本模型什么时候放出来?

Required prerequisites

Questions

目前开放的baichuan-7b应该是基础模型,没法直接对话。

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] pretrain 的训练 config

Required prerequisites

Questions

基于上述的几个优化技术,我们在千卡A800机器上达到了7B模型182Tflops的吞吐,GPU峰值算力利用率高达58.3% 。

首先感谢贵司的开源精神~

然后就是这里达到对应吞吐的时候的 pretrain config 可以放出来吗~

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

Why using DeepSpeed to acceleate training?

你们竟然把这么完整的模型和对比数据放出来,真respect!
我有个问题,在模型加速方面,隔壁Meta公司有个fairseq框架(提供了混合精度加速训练和层次化对象注册机制以灵活调整超参数等功能),你们是出于什么考虑而选择使用DeepSpeed呢?

[Question] RuntimeError: Internal: src/sentencepiece_processor.cc(1101) 官方demo报错

Required prerequisites

Questions

$ python demo.py
Traceback (most recent call last):
File "/data/good/baichuan-7B/demo.py", line 3, in
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
File "/home/good/anaconda3/envs/baichuan/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 693, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/good/anaconda3/envs/baichuan/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
return cls._from_pretrained(
File "/home/good/anaconda3/envs/baichuan/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/data/good/.cache/huggingface/modules/transformers_modules/baichuan-7B/tokenization_baichuan.py", line 89, in init
self.sp_model.Load(vocab_file)
File "/home/good/anaconda3/envs/baichuan/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/good/anaconda3/envs/baichuan/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

$ cat demo.py
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[BUG] 模型训练时,报错CUDA error: device-side assert triggered

Required prerequisites

System information

import sys, transformers
print(sys.version, sys.platform)
3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110] linux
print(transformers.version)
4.30.2

torch==2.0.0
cuda=11.7

Problem description

我这里没有将CUDA_LAUNCH_BLOCKING设置为1后深入debug,粗略看起来像是越界错误

Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
...
...
< srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [54,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [55,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [56,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [57,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [58,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [59,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [60,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [61,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [62,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [749,0,0], thread: [63,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [96,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [97,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [98,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [99,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [100,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [101,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [102,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [103,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [104,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [105,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [106,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [107,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [108,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [109,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [110,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [111,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [112,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [113,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [857,0,0], thread: [114,0,0] AssertionsrcIndex < srcSelectDimSize` failed.
...
...

报错位置在执行这一行代码时:https://github.com/baichuan-inc/baichuan-7B/blob/4a7a461854b261ab7ec1fd890a5fb0cce0518d16/models/modeling_baichuan.py#L47

Additional context

我翻看了一下其他issue, 在这个issue:#23 提到了将tokenizer的pad_id设置为0就不会报错了,具体位置:#23 (comment)

我这边尝试了一下确实能跑通,但是id为0的token我看了下是unk,这样设置是否会和预训练任务有gap?

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[问题] 示例代码推理运行性能在 RTX 4090 上很差

使用示例代码:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->\n', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=512, do_sample=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

在 model.generate() 之前增加打印信息,完整的输出需要1min左右。
显存占用23GB、GPU 使用率平均80%。

[Question] 请问为何tokenize以后没有bos_token?

Required prerequisites

Questions

感谢你们的工作!我有个小小的疑问,为何使用你们的tokenizer以后,没有bos_token呢?我因为代码频繁告警才发现了这点。
比如同样的一句话,What is your name?

使用llama的tokenizer以后,结果是:
1, 1724, 338, 596, 1024, 29973
可以看到1就是bos_token
而使用你们的tokenizer以后,结果是:
2276, 715, 879, 1868, 81
没有bos_token。

请问这个是刻意为之的吗?因为我看了你们的special_token_map,里面是有bos token的。那我后续使用,是需要人工加上这个bos_token,还是就这么不用管他,直接tokenize的东西进去推理即可?

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] 实现了百川模型的低资源量化训练和部署

Required prerequisites

Questions

实现了百川模型的低资源量化训练和部署:https://github.com/jianzhnie/Efficient-Tuning-LLMs
LoRA微调可在单块3090 GPU上运行,同时支持QLoRA方法,最低8G 显存。
运行以下指令即可实现 4bit `量化训练:

python qlora_finetune.py \
    --model_name_or_path  baichuan-inc/baichuan-7B\
    --dataset_name alpaca \
    --output_dir ./work_dir/baichuan-7B \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy steps \
    --eval_steps 2000 \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --save_strategy steps \
    --save_total_limit 5 \
    --save_steps 500 \
    --logging_strategy steps \
    --logging_steps 10 \
    --learning_rate 0.0002 \
    --warmup_ratio 0.03 \
    --weight_decay 0.0 \
    --lr_scheduler_type constant \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --max_new_tokens 32 \
    --source_max_len 512 \
    --target_max_len 512 \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_dropout 0.1 \
    --double_quant \
    --quant_type nf4 \
    --fp16 \
    --bits 4 \
    --gradient_checkpointing \
    --do_train \
    --do_eval \
    --data_seed 42 \
    --seed 0

下面是部署的截图
image
image

[Question] 大家的输出效果也这么拉吗?

Required prerequisites

Questions

用户:你是谁
模型:我是一个平凡的人,一个普普通通的人,一个平平凡凡的人,一个平平常常的人,一个平平常常的人,一个平平常常的人,一个平平常常的人,一个平平常常的人,一个平平常常的人,一个平平常常的人,一个平平

用户:你能做什么
模型:你能做什么? 2010年10月29日 星期二 晴 昨天晚上,我正在看电视,突然,妈妈叫我:"快点,快点,你快去洗脸,洗脚,洗脚,洗脚,洗脚,洗脚,

用户:你是谁
模型:《你是谁》是连载于起点中文网的网络小说,作者是小鱼儿。 都市小说小说类型都市小说内容简介一个平凡的女孩,在一次偶然的机会下,认识了两个帅哥,从此,她的人生发生了翻天复地的变化..

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

模型词表相关疑问

您好,感谢开源!想针对本模型的词表构建和测试咨询三个问题:

  1. 请问词表是基于什么数据进行训练的?可否提供一些训练细节,比如中英文,或其他语言的数据占比?
  2. 请问压缩率指标是在什么语料上测试的?这部分的评测脚本可否公开?
  3. 您在分词部分第三点提到的,“对罕见字词的全覆盖”,是指sentence piece的byte_fallback参数吗?
    盼回复!

[Question] 有没有和vicuna、RWKV的性能对比

Required prerequisites

Questions

在LMSYS Org组织的LLM排位赛中,vicuna、RWKV评分仅次于ChatGPT和Claude。baichuan有没有和这两个模型的性能对比

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

实现了baichuan-7B模型的LoRA微调

支持Alpaca等指令数据集的SFT和RLHF流程:https://github.com/hiyouga/LLaMA-Efficient-Tuning

LoRA微调可在单块3090 GPU上运行,同时支持QLoRA方法。(最低12G显存)

微调模型的 LoRA 权重:https://huggingface.co/hiyouga/baichuan-7b-sft

运行以下指令即可实现 Alpaca 数据集指令微调(instruction-tuning):

CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
    --model_name_or_path baichuan-7B模型文件夹路径或huggingface地址 \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type lora \
    --lora_rank 8 \
    --lora_target W_pack \
    --output_dir alpaca_baichuan \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --num_train_epochs 3.0 \
    --dev_ratio 0.01 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --plot_loss \
    --fp16

程序运行截图示例:
20230615160340

经过LoRA指令微调后的对话效果:
20230615164836

lm-evaluation-harness中文项目能力测试结果,对比WizardLM[Question]

Required prerequisites

Questions

感谢百川团队的贡献,为了对比 baichuan-7B 的中文能力,我选择了 lm-evaluation-harness 当中的中文测试项目 xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh,mgsm_zh,其中xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh倾向于推理,mgsm_zh倾向于数学。我进行了两次测试,一次是num_fewshot为0,一次num_fewshot为5。需要提到的是因为 lm-evaluation-harness 默认不支持tokenizer的trust_remote_code,为了运行起来不得不小小hack了一下,其余均保持原样。

结果如下:
hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0360 ± 0.0118
xcopa_zh 0 acc 0.6700 ± 0.0210
xstory_cloze_zh 0 acc 0.6320 ± 0.0124
xwinograd_zh 0 acc 0.7857 ± 0.0183
xnli_zh 0 acc 0.3818 ± 0.0069

hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None

Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0960 ± 0.0187
xcopa_zh 0 acc 0.7240 ± 0.0200
xstory_cloze_zh 0 acc 0.6565 ± 0.0122
xwinograd_zh 0 acc 0.8016 ± 0.0178
xnli_zh 0 acc 0.4341 ± 0.0070

对比WizardLM-7B的中文能力:
hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0280 ± 0.0105
xcopa_zh 0 acc 0.5340 ± 0.0223
xstory_cloze_zh 0 acc 0.5162 ± 0.0129
xwinograd_zh 0 acc 0.5417 ± 0.0222
xnli_zh 0 acc 0.3439 ± 0.0067

hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None

Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0360 ± 0.0118
xcopa_zh 0 acc 0.5420 ± 0.0223
xstory_cloze_zh 0 acc 0.5242 ± 0.0129
xwinograd_zh 0 acc 0.6071 ± 0.0218
xnli_zh 0 acc 0.3599 ± 0.0068

对比可以看到中文能力相比LLAMA系列的衍生品的确提高了很多,希望百川团队越做越好!

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] 重复提问后,回答一直没有改变

Required prerequisites

Questions

问题: **目前通用的火警电话是
回答: **目前通用的火警电话是119,为什么不是0? 因为**没有零号

重复提问后,回答一直没有改变,其他问题也是如此

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] 模型效果

Required prerequisites

Questions

curl -X POST "http://10.20.20.221:8000" -H 'Content-Type: application/json' -d '{"prompt": "**有什么好吃的"}'
{"response":"**有什么好吃的?\n**有什么好吃的?\n**的美食很多,有**的特色小吃,有**的特色菜,有**的特色甜点,有**的特色饮料,有**的特色水果,有**的特色海鲜,有**的特色小吃,有**的特色小吃","status":200,"time":"2023-06-15 19:06:16"}

==包装了下接口,看着效果有点答非所问

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

基座问题

请问,发布的这个模型,是经过SFT(有监督指令微调)后的版本吗?还是预训练后未做其他训练的版本?

官方示例输出,是否符合期望?

第一行是输入

Baichuan-User: 登鹳雀楼->王之涣\n夜雨寄北->\n
登鹳雀楼->王之涣\n夜雨寄北->\n客至->\n望月怀远->\n凉州词->\n
一、根据诗句的意思,给下列古诗找出对应的作者
1、一览众山小\n( \n)
2、飞流直下三千尺\n( \n)
3、春风又绿江南岸\n( \n)
4、但看黄河入海流\n( \n)
5、日暮乡关何处是\n( \n)
6、但愿人长久\n( \n)
7、举头望明月,低头思故乡\n( \n)
8、故人西辞黄鹤楼\n( \n)
9、无边落木萧萧下\n( \n)
10、一览众山小
作者( \n)
11、飞流直下三千尺
作者( \n)
12、春风又绿江南岸
作者( \n)
13、但看黄河入海流
作者( \n)
14、日暮乡关何处是
作者( \n)
15、但愿人长久
作者( \n)
16、举头望明月,低头思故乡
作者( \n)
17、故人西辞黄鹤楼
作者( \n)
18、无边落木萧萧下
作者( \n)
19、一览众山小
作者( \n)
20、春风又绿江南岸
作者( \n)
21、但看黄河入海流
作者( \n)
22、日暮乡关何处是
作者( \n)
23、但愿人长久
作者( \n)
24、举头望明月,低头思故乡
作者

[Question] 预训练数据的构成是怎么样的?

Required prerequisites

Questions

感谢项目组开源的这个模型。预训练数据是决定LLM性能,CoT思维练的来源的最重要因素。
我建议项目组应该像LLaMA: Open and Efficient Foundation Language Models文章那样,公布预训练数据的构成,采样权重等设置。这是复现,判断,分析模型性能最重要的依据。
感谢!!

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] 不看测试数据集,我就问一个问题 "你好"

Required prerequisites

Questions

截屏2023-06-19 上午9 38 49
这回答?说不过去吧?

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] 能提供一个类似页面聊天框的web_demo.py么?

Required prerequisites

Questions

如题,像chatGLM那样.

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] 预训练时间和预训练数据

Required prerequisites

Questions

  1. 请问该模型在千卡集群上训练了多久啊?
  2. README中提到了在大约 1.2T token上做了预训练,数据中语言的分布是怎样的啊?
    感谢回复!

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

[Question] temperature,top_p等参数在那设置?

Required prerequisites

Questions

如题,是在generate里设置么?官方例子如下:
model.generate(**inputs, max_new_tokens=64)

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

HF版官方推理代码没有输出。

为啥我这推理没有任何输出?
prompt='Hamlet->Shakespeare\nOne Hundred Years of Solitude->' inputs = tokenizer(prompt, return_tensors='pt') inputs = inputs.to('cuda:0') pred = model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) print(inputs) print(pred)

Hamlet->Shakespeare One Hundred Years of Solitude-> {'input_ids': tensor([[ 4891, 1438, 3817, 2798, 16020, 5, 3665, 745, 4475, 10063, 679, 4088, 5248, 3817]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')} tensor([[ 4891, 1438, 3817, 2798, 16020, 5, 3665, 745, 4475, 10063, 679, 4088, 5248, 3817, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')

看起来输出和输入一模一样。

感觉底座模型的对话能力不行啊 所以是针对这些特殊下游任务比较有优势 要增加对话能力还得做SFT?

输入:
inputs = tokenizer('用中文介绍一下百川大模型', return_tensors='pt')
pred = model.generate(**inputs, max_new_tokens=48, do_sample=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

输出:
用中文介绍一下百川大模型
中文的
百川,是百川融媒信息科技有限公司的名称,我们致力于“为广大企业提供从企业成立到发展壮大、从小到大的全寿命管理解决方案,帮助企业实现价值最大化。”
百川融媒信息科技有限公司由**知名新媒体教育培训品牌“麦派学堂”联合《大河报》发起成立。
百川融媒信息科技有限公司通过对新媒体运营、新技术、新产品研发等方面的创新和推广,建立服务于全国中小企业的行业生态,使百川融媒成为帮助客户提升企业价值的行业典范。

输入:
inputs = tokenizer('你是谁', return_tensors='pt')
pred = model.generate(**inputs, max_new_tokens=48, do_sample=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

输出:
你是谁?
你曾经对一个人好,可那个人并不知道。这世上没有不透风的墙,他一定会后悔的。
你好:
这句话的字面上的意思是(那个人)曾经对你好,

给大家一个方便运行的程序代码(cli_demo.py),对多GPU支持更友好些,需要的可以复制过去跑一下

import os
import platform
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

#特点:
#1.自动支持cpu及gpu模式
#2.使用gpu时使用half模式载入,减少一半显存
#3.使用gpu时多显卡模式自动分布载入
#4.暂不支持 聊天上下文功能
#5.暂不支持 打字输出效果 (所以答案太长时会卡死,可以调整MAX_TOKENS来暂时解决)
#作者: [email protected] 个人兴趣开发者/杨,有问题也可以邮我

def auto_configure_device_map(num_gpus: int):
num_trans_layers = 32
per_gpu_layers = num_trans_layers / num_gpus
device_map = {'model.embed_tokens': 0,
'model.norm': num_gpus-1, 'lm_head': num_gpus-1}
for i in range(num_trans_layers):
device_map[f'model.layers.{i}'] = int(i//per_gpu_layers)
return device_map

#MODEL_NAME = "../baichuan-7B-model"
MODEL_NAME = "baichuan-inc/baichuan-7B"

NUM_GPUS = torch.cuda.device_count() if torch.cuda.is_available() else None
MAX_TOKENS = 512
device_map = auto_configure_device_map(NUM_GPUS) if NUM_GPUS>0 else None
device = torch.device("cuda") if NUM_GPUS>0 else torch.device("cpu")
device_dtype = torch.half if NUM_GPUS>0 else torch.float

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True, device_map=device_map, torch_dtype=device_dtype)
model = model.eval()

os_name = platform.system()
clear_command = 'cls' if os_name == 'Windows' else 'clear'
hello_string = "欢迎使用 BaiChuan-7B 模型,输入内容即可进行对话,clear 清空对话历史,stop/exit/quit 终止程序"

def build_prompt(history):
prompt = hello_string
for query, response in history:
prompt += f"\n\n用户: {query}"
prompt += f"\n回复: {response}"
return prompt

history = []
print(hello_string)
while True:

query = input("\n用户: ")

if query.strip() in ["stop", "stop()", "exit", "exit()", "quit", "quit()", "q", "q()"]:
    break
if query.strip() in ["clear", "clear()", "cls", "cls()"]:
    history = []
    os.system(clear_command)
    print(hello_string)
    continue

inputs = tokenizer(query, return_tensors='pt')
inputs.input_ids = inputs.input_ids.to(device)
inputs.attention_mask = inputs.attention_mask.to(device)
pred = model.generate(inputs=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=MAX_TOKENS, repetition_penalty=1.1)
response = tokenizer.decode(pred.cpu().tolist()[0])
response = response[len(query)+response.find(query):]
if response[-4:] == "</s>": response = response[:-4]

history += [(query, response)]
print(f"\n回复: {response}")

os.system(clear_command)
print(build_prompt(history), flush=True)

[Question] 有没有低精度推理的模式呢?在哪里配置

Required prerequisites

Questions

直接按照官方代码跑起来需要28G显存,7b的模型需要这么高的显存着实不大合理,尝试用如下方法会报错:
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", device_map="auto", trust_remote_code=True).half()

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

评测疑似数据泄漏?

使用CEval的问题作为输入,baichuan-7B模型会自动续写ABCD选项和答案,疑似数据泄漏?

输入:
在Unix中,passwd命令位于____目录中的。

输出:
在Unix中,passwd命令位于____目录中的。 A. /etc/ B. /usr/ C. /bin/ D. /usr/bin/ 答案:A</s>

LLaMA-Vicuna-13B 和 Baichuan-Vicuna-7B 的对比评测数据(由GPT4打分,供大家参考)

Required prerequisites

Questions

首先,分别感谢百川团队以及baichuan-vicuna-7b的工作。 考虑到大家可能对于经过SFT后的模型性能感兴趣(#37 ),在此分享一下使用 FastChat 的评估集由 GPT4 进行打分后的 Baichuan Vicuna 7b的评测结果:

https://baichuan-vicuna-eval.pleisto.app/

考虑到 baichuan-vicuna-7b 主要是用 ShareGPT 数据集做的训练,中文数据占比相对较小,因此直接拿 FastChat 的英文评估集进行评测,并和同样基于 ShareGPT数据集训练的 LLaMA Vicuna 13B 进行横向对比,可能是比较有实际意义的。

由GPT4生成的评测总结如下:

根据提供的评分数据,我们可以对两个LLM模型baichuan-vicuna-7bllama-vicuna-13b进行详细分析。首先,我们将分别计算每个任务的平均分,然后对模型进行总体评价。

写作任务:

baichuan-vicuna-7b:(9 + 9 + 9 + 9 + 9 + 8 + 7 + 9 + 7 + 8) / 10 = 8.5
llama-vicuna-13b:(8.5 + 9 + 9 + 10 + 9 + 9 + 9 + 9 + 8 + 9) / 10 = 9.05

角色扮演任务:

baichuan-vicuna-7b:(8 + 5 + 4 + 9 + 9 + 8 + 8 + 8 + 8 + 8) / 10 = 7.5
llama-vicuna-13b:(9 + 10 + 9 + 9 + 9 + 9 + 9 + 9 + 9 + 9) / 10 = 9.1

常识知识:

baichuan-vicuna-7b:(9 + 8 + 9 + 9 + 9 + 9 + 9 + 9 + 9 + 9) / 10 = 8.9
llama-vicuna-13b:(8.5 + 9 + 9 + 8.5 + 9 + 8.5 + 8.5 + 10 + 8.5 + 9) / 10 = 8.85

费米问题:

baichuan-vicuna-7b:(5 + 4 + 5 + 4 + 7 + 2 + 6 + 5 + 4 + 6) / 10 = 4.8
llama-vicuna-13b:(8 + 8 + 7 + 9 + 9 + 8 + 8 + 8 + 7 + 8) / 10 = 8

反常识问题:

baichuan-vicuna-7b:(4 + 9 + 8 + 8 + 8 + 9 + 9 + 8 + 6 + 8) / 10 = 7.7
llama-vicuna-13b:(8 + 9 + 9 + 9 + 9.5 + 8.5 + 9 + 9 + 9 + 9) / 10 = 8.9

编程:

baichuan-vicuna-7b:(4 + 3 + 7.5 + 3 + 4 + 4 + 5) / 7 = 4.36
llama-vicuna-13b:(2 + 2 + 6.5 + 5 + 5 + 2 + 6) / 7 = 4

数学:

baichuan-vicuna-7b:(2 + 4 + 4) / 3 = 3.33
llama-vicuna-13b:(4 + 5 + 2) / 3 = 3.67

一般性开放问答:

baichuan-vicuna-7b:(9 + 8 + 6 + 9 + 9 + 9 + 9 + 9 + 9 + 9) / 10 = 8.7
llama-vicuna-13b:(8.5 + 9 + 8 + 9 + 9 + 8.5 + 9 + 8.5 + 7 + 10) / 10 = 8.65

专业知识:

baichuan-vicuna-7b:(9 + 9 + 7 + 8 + 8 + 8 + 9 + 9 + 9 + 9) / 10 = 8.5
llama-vicuna-13b:(8.5 + 9 + 9 + 9 + 9 + 9 + 9 + 8 + 8.5 + 9.5) / 10 = 8.85

根据以上计算得出的平均分,我们可以看到,在10个任务中,llama-vicuna-13b在7个任务上的表现优于baichuan-vicuna-7b(写作任务、角色扮演任务、费米问题、反常识问题、编程、数学和专业知识),而baichuan-vicuna-7b在3个任务上的表现优于llama-vicuna-13b(常识知识、一般性开放问答和编程任务)。

总的来说,llama-vicuna-13b的表现更优,因为其在更多任务中的平均得分高于baichuan-vicuna-7b。然而,根据提供的数据,我们也可以看出,两个模型在某些任务上的表现相当接近,例如常识知识、一般性开放问答和专业知识。llama-vicuna-13b在费米问题、反常识问题和角色扮演任务上的表现显著优于baichuan-vicuna-7b,而baichuan-vicuna-7b在编程任务上的表现略优于llama-vicuna-13b

考虑到baichuan-vicuna-7b(7B参数)和llama-vicuna-13b(13B参数)之间的参数量差异,我们需要重新评估它们的性能。一般来说,参数量较大的模型在性能上可能更好,但同时计算资源消耗也更高,因此在实际应用中需要权衡。

由于在上述10个任务中,模型B在7个任务上的表现优于baichuan-vicuna-7b,而baichuan-vicuna-7b在3个任务上的表现优于llama-vicuna-13b。尽管llama-vicuna-13b在多数任务中表现较好,但在某些任务上,如常识知识、一般性开放问答和编程任务,两者的表现相差不大。这意味着在这些任务中,baichuan-vicuna-7b在性价比方面可能更具优势。

对于不同的应用场景,我们可以根据以下建议选择合适的模型:

  1. 如果计算资源充足,且需要在各个任务上都获得较好的性能,可以选择参数量较大的模型B。
  2. 如果计算资源有限,或者需要在特定任务(如常识知识、一般性开放问答和编程任务)上优化性价比,可以考虑选择参数量较小的baichuan-vicuna-7b
  3. 对于费米问题、反常识问题和角色扮演任务等,llama-vicuna-13b具有明显优势,因此在这些任务上可以优先考虑llama-vicuna-13b

总之,在考虑参数量差异后,我们可以得出结论:模型B在性能上优于baichuan-vicuna-7b,但计算资源消耗也更高。在实际应用中,根据任务需求和计算资源限制,可以在baichuan-vicuna-7bllama-vicuna-13b之间进行权衡。

Checklist

  • I have provided all relevant and necessary information above.
  • I have chosen a suitable title for this issue.

tokenizer中的<pad>

模型输出的词表大小维度是64000;而tokenizer的__len__是64001;

PADDING_TOKEN = list(tokenizer.added_tokens_encoder.keys())[0] # padding在github显示不出来。。。。。

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
len(tokenizer)
64001
tokenizer.vocab_size
64000
tokenizer(PADDING_TOKEN)
{'input_ids': [64000], 'attention_mask': [1]}

对模型输入“PADDING_TOKEN”会报错,复现路径:
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer(PADDING_TOKEN, return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.