Giter Site home page Giter Site logo

tiny-llm-zh's Introduction

Tiny LLM zh

1.简介

本项目旨在构建一个小参数量的中文语言大模型,用于快速入门学习大模型相关知识,如果此项目对你有用,可以点一下start,谢谢!

模型架构:整体模型架构采用开源通用架构,包括:RMSNorm,RoPE,MHA等

实现细节:实现大模型两阶段训练及后续人类对齐,即:分词(Tokenizer) -> 预训练(PTM) -> 指令微调(SFT) -> 人类对齐(RLHF, DPO) -> 测评 -> 量化 -> 部署。

项目已部署,可以在如下网站上体验。

项目特点:

  • 公开全部数据及代码,包括预训练数据,tokenizer等;(Tiny LLM Datasets
  • 走通大模型整个流程:分词(Tokenizer) -> 预训练(PTM) -> 指令微调(SFT) -> 人类对齐(RLHF, DPO) -> 测评 -> 部署;
  • 公开预训练token 35B,SFT数据400w条,RL数据 17w条;
  • 训练 Tokenizer:10G 中文百科文本训练 20K 中文词表,与 Llama2 词表合并,构建Tiny LLM词表;
  • 使用 Transformers deepspeed 进行训练,支持多机多卡,支持 Zero 等优化技术;
  • 所有代码 Bash 脚本启动,支持不同大小的模型,如16m, 42m, 92m, 210m, 440m等;
  • 支持 MoE 架构,在 tiny_llm_moe 支持最新共享专家,平衡专家等技术;
  • 支持 vLLM 推理框架;
  • 支持 llama.cpp 推理框架;

本项目主要有三个分支,推荐学习 主分支,具体区别如下:

  • llama2_torch : 模型架构采用原版 Llama2 架构,只是将部分的输入输出修改为适合训练的格式;
  • main tiny_llm : 对齐开源社区模型,使用Transformers库构建底层模型,也使用Transformers库进行多卡多机训练;
  • tiny_llm_moe : 在tiny_llm的基础上,修改 MLP层为MoE模型,使用Transformers库进行多卡多机训练。

注意:

  1. 因资源限制,本项目的第一要务是走通大模型整个流程,而不是调教比较好的效果,故评测结果分数较低,部分生成错误。
  2. 详细的数据处理,训练过程见 doc 文件夹(正在整理。。。)

2.快速开始

模型已托管在 HuggingfaceModeScope 中,可运行代码自动下载。

建议使用 Huggingface 在线加载模型,如果运行不了,在试 ModeScope ;如果需要本地运行,修改model_id中的路径为本地目录,即可运行。

依赖安装

  • python 3.8 and above
  • pytorch 2.0 and above
  • transformers 4.37.2 and above
  • CUDA 11.4 and above are recommended. (if training)
pip install -r requirements.txt 

🤗 HuggingFace

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation import GenerationConfig

model_id = "wdndev/tiny_llm_sft_92m"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)
generation_config = GenerationConfig.from_pretrained(model_id, trust_remote_code=True)
sys_text = "你是由wdndev开发的个人助手。"
# user_text = "世界上最大的动物是什么?"
# user_text = "介绍一下刘德华。"
user_text = "介绍一下**。"
input_txt = "\n".join(["<|system|>", sys_text.strip(), 
                        "<|user|>", user_text.strip(), 
                        "<|assistant|>"]).strip() + "\n"

generation_config.max_new_tokens = 200
model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

🤖 ModeScope

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_id = "wdndev/tiny_llm_sft_92m"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)

sys_text = "你是由wdndev开发的个人助手。"
# user_text = "世界上最大的动物是什么?"
# user_text = "介绍一下刘德华。"
user_text = "介绍一下**。"
input_txt = "\n".join(["<|system|>", sys_text.strip(), 
                        "<|user|>", user_text.strip(), 
                        "<|assistant|>"]).strip() + "\n"

model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=200)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

生成效果

问:世界上最大的动物是什么?
答:目前已知最大的动物是蓝鲸(Balaenoptera musculus),这是一个庞大的哺乳动物,属于须鲸亚目、须鲸科中的最大物种。蓝鲸的身长可达30米以上,体重可达175吨。它们在海洋中生活,主要以浮游生物为食,如甲壳类动物和小型鱼类等。由于其巨大的体型和复杂的生态群落,蓝鲸成为海洋旅游的热门景点之一。

问:介绍一下刘德华。
答:刘德华是一位香港流行歌手、演员和导演,他在音乐界的贡献非常巨大。他是华语乐坛历史上最伟大的艺人之一,代表作品包括《爱我身体》和《肥皂泡》。他也经常参演电影和电视剧,并在电视上受到好评。

问:介绍一下**。
答:**是位于东亚的大陆,被欧洲以及亚洲和其他大陆所包围。它是**第二大文明和世界上最大的经济体之一。**的历史可以追溯到公元前5000年左右,从古至今都有其独特的文化和语言传承者。

3.模型

3.1 Tokenizer

LLM分词器的构建方式有两种:一种是自己构造词表,训练一个分词器;另一种是选择开源模型训练好的分词器。

本项目为了方便,从优秀的开源项目中选择词表,考虑到训练的模型较小,且词表大小影响模型大小,故优先选择词表较小的开源项目;经过比较,最终选择 ChatGLM3 的词表,该词表大小为 64798 。

自己构造词表方式见 tokenizer,扩充 LLaMA2的32K词表为50K,增加20K中文词表,详细扩充方式见文档tokenizer/README.md.

注意:本项目使用的ChatGLM3的词表。

3.2 模型结构

模型结构采用类Llama2的结构,具体包括:RMSNorm,RoPE,MHA等;

3.3 模型尺寸

具体参数细节如下所示:

model hidden size intermediate size n_layers n_heads max context length params vocab size
tiny-llm-16m 120 384 6 6 512 16M 64798
tiny-llm-42m 288 768 6 6 512 42M 64798
tiny-llm-92m 512 1024 8 8 1024 92M 64798
tiny-llm-210m 768 2048 16 12 1024 210M 64798
tiny-llm-440m 1024 2816 24 16 1024 440M 64798
tiny-llm-1_5b 2048 5504 24 16 1024 1.5B 64798

3.4 模型评估

因训练数据和微调数据,大部分都是中文数据,所以在C-EvalCMMLU这两个数据集上进行模型的评估;使用OpenCompass工具,进行模型评估,评估分数如下所示:

model Type C-Eval CMMLU
tiny-llm-92m Base 23.48 25.02
tiny-llm-92m Chat 26.79 26.59

Base模型,采用评测方式 ppl 方式进行评测;Chat模型,采用 gen 方式评测。具体区别如下图所示:

ppl gen

来源:ppl和gen模式有什么区别

注意:只对常用的两个模型进行了评测,分数较低,其余模型评测意义不大。

4.Demo

4.1 网页Demo

网页Demo已部署,可以在如下网站上体验:ModeScope Tiny LLM

如果想在本地运行网页Demo,注意修改 web_demo.py 文件中模型的路径model_id,输入如下命令即可运行:

streamlit run web_demo.py

web demo

tiny-llm-zh's People

Contributors

wdndev avatar

Stargazers

HuZuohan avatar Liao yinan avatar FANSHI LI avatar  avatar  avatar Evan avatar  avatar  avatar  avatar Lifan Sun avatar  avatar sql avatar  avatar  avatar Flechazo. avatar yatere avatar  avatar  avatar Makka Pakka avatar Hang Hoo avatar canglangzhishui avatar XiangyuYang avatar  avatar Mr_D avatar Dennis avatar  avatar  avatar  avatar Mesomer avatar Justin avatar HarveyWang avatar MuZi Wong avatar  avatar Yunling Feng avatar Tangsan avatar yaor avatar Will avatar Tulipdu avatar suhejian avatar Jeff Xie avatar  avatar Xingjian Chen avatar zhaobaoqiang avatar  avatar snowflakewang avatar Xinyu Cai avatar  avatar  avatar Mir Du avatar tomjamescn avatar  avatar MinghuiZou avatar  avatar  avatar Xingbang Liu avatar  avatar Crxzy avatar  avatar AngelaSunny avatar  avatar  avatar  avatar Kun.Wang avatar jweihe avatar Jackie Moo avatar Gsscsd avatar Tankoldable avatar  avatar  avatar Jesse Zhang avatar Doit avatar  avatar  avatar  avatar  avatar 李逍遥 avatar tobeprozy avatar AnkaYS avatar Linxi avatar dpdpdpy avatar  avatar  avatar  avatar  avatar  avatar Tony Chen avatar leo.for avatar  avatar buhe avatar  avatar jupiterSense avatar  avatar  avatar  avatar 王梦迪 avatar garygali@csu.edu.cn avatar  avatar yige shen avatar  avatar RANYABING avatar

Watchers

 avatar  avatar Kostas Georgiou avatar  avatar  avatar

tiny-llm-zh's Issues

小白关于SFTDataset中labels的一点疑问

class SFTDataset(Dataset):
    ...
    def preprocessing(self, example, debug=False):
        input_ids, labels = [], []
        prompt_txt = self.system
        # print(type(example))
        user_txt = example["question"]
        assistant_txt = example["answer"]
        instruction = self.tokenizer.encode(text="\n".join(["<|system|>", prompt_txt.strip(), 
                                    "<|user|>", user_txt.strip(), 
                                    "<|assistant|>"]).strip() + "\n",
                                    add_special_tokens=True, 
                                    truncation=True, 
                                    max_length=self.max_length)
        response = self.tokenizer.encode(assistant_txt.strip(), add_special_tokens=False, truncation=True, max_length=self.max_length)
        **input_ids = instruction + response + [self.tokenizer.eos_token_id]
        labels = [self.tokenizer.pad_token_id] * len(instruction) + response + [self.tokenizer.eos_token_id]**
        if (len(input_ids) > self.max_length):
            return None
        if debug:
            print(self.tokenizer.decode(input_ids))
            print("-------------------------------")
        pad_len = self.max_length - len(input_ids)
        input_ids += [self.tokenizer.pad_token_id] * pad_len
        labels += [self.tokenizer.pad_token_id] * pad_len
        labels = [(l if l != self.tokenizer.pad_token_id else -100) for l in labels]
        input_ids = torch.LongTensor(input_ids)
        labels = torch.LongTensor(labels)
        attention_mask = input_ids.ne(self.tokenizer.pad_token_id)
        return {
            "input_ids": input_ids,
            "labels": labels,
            "attention_mask": attention_mask,
        }
    ...
    def __getitem__(self, idx) -> Dict[str, torch.Tensor]:
        processed_example = self.preprocessing(self.data[idx])
        while processed_example is None:
            idx = (idx + 1) % len(self.data)  # 循环至下一个有效样本
            processed_example = self.preprocessing(self.data[idx])
        return processed_example

请问为啥SFTDataset中的labels不用左移一位呢?
类似PTMDatasetMap中的

def __getitem__(self, index: int):
        real_index = self.shuffled_indices[index]
        fi, i = self.index_map[real_index]
        sample = self.data[fi][i]
        X = np.array(sample[:-1]).astype(np.int64)
        Y = np.array(sample[1:]).astype(np.int64)
        input_ids = torch.LongTensor(X)
        labels = torch.LongTensor(Y)

        return {
            "input_ids": input_ids,
            "labels": labels,
        }

DPO训练

作者你好,请问公开部署的92M的模型采用了DPO训练吗 ?
我在测试的时候发现效果很好 ,对比我用baby_llama_chinese2自己训练出来的模型好多了。
请问你在测试模型的时候有遇到乱回答或者重复回答的问题吗
虽然在你公开的数据集上看到了RL的数据但是并没有公开相关的代码,是并没有采用改方法,还是只是没有公布呢?

关于transformers无法识别模型类型 ValueError: The checkpoint you are trying to load has model type `tinyllm` but Transformers does not recognize this architecture.

在网络上尝试修改transformers版本号 同时修改模型中的config中为对应版本号也没有解决
{
"architectures": [
"TinyllmForCausalLM"
],
"attention_dropout": 0.0,
"hidden_act": "silu",
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 1408,
"max_position_embeddings": 1024,
"model_type": "tinyllm",
"num_attention_heads": 8,
"num_hidden_layers": 8,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.42.3",
"use_cache": true,
"vocab_size": 64798
}

[rank0]: Traceback (most recent call last):
[rank0]: File "sft_train.py", line 193, in
[rank0]: main()
[rank0]: File "sft_train.py", line 149, in main
[rank0]: config = transformers.AutoConfig.from_pretrained(
[rank0]: File "/home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 984, in from_pretrained
[rank0]: raise ValueError(
[rank0]: ValueError: The checkpoint you are trying to load has model type tinyllm but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
E0708 15:05:53.145956 139943936660672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1131943) of binary: /home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/bin/python
Traceback (most recent call last):
File "/home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/bin/torchrun", line 8, in
sys.exit(main())
File "/home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/daichenrui2404/miniconda3/envs/TINY_LLM_ZH/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sft_train.py FAILED

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.