Giter Site home page Giter Site logo

mini_llm's Introduction

Mini-llm

Created by Lil2J

📝介绍

本项目是我个人关于一个小参数量的中文大模型的一个实践复现。

主要参考这两个开源项目:

1.https://github.com/charent/Phi2-mini-Chinese

2.https://github.com/DLLXW/baby-llama2-chinese

3.https://github.com/charent/ChatLM-mini-Chinese

包含:预训练、SFT指令微调、DPO、PPO(待做)完整流程。

希望分享给大家,也希望大家一起来完善!

📚项目简介

  • 训练一个参数量1.4b预训练模型,基座模型选的是QWEN,训练的token数量为8b左右
  • 构建包含预训练、SFT指令微调、DPO整个完整流程的LLM代码仓库,包含DeepSpeed分布式训练技术

🌟Quick Start

# 1. 在“Baby-llama2-chinese Corpus”的百度网盘中下载维基百科和百度百科的预训练语料和aplca数据。
#    在https://huggingface.co/datasets/Skywork/SkyPile-150B/tree/main 上下载数据
#    在https://huggingface.co/BelleGroup 上下载train_2M_CN,train_1M_CN和train_0.5M_CN
#    因为算力资源有限,我只下载了前20个数据文件
#    将所有数据tokenize之后,token数量大概为8b
# 2. 将下载好的数据放到你想要的目录下
# 3. 切换到dataset_utils目录下运行generate_data.py,运行前修改py文件,将处理数据的函数的注释去掉,才能运行起来
# 4. 运行generate_data.py.py,在./datasets/目录下生成parquet文件
cd dataset_utils
python3 generate_data.py
#5. 修改train.sh 文件 如果是单卡运行的话  把--multi_gpu 去掉,然后--config_file 后面接accelerate_one_gpu.yaml  如果是多卡的话,就把 accelerate_multi_gpu.yaml中 num_processes: 4
#改为自己的卡数

#开启预训练
sh train.sh pre_train.py

# 多机多卡训练
# 使用文件 accelerate_multi_gpus_on_multi_nodes.yaml, 其中:
# 采用了deepspeed standard任务提交模式,num_machines为节点数量,num_processes为全部可用GPUs的数量
# 使用多机多卡训练时需要保证以下几个步骤:
#1. 多节点免密登录,且节点登录用户名一致,同时将节点的访问用户名写入各节点host文件
#2. 多节点环境一致,主要是cuda版本,nccl版本,pytorch版本等,三者之间的版本也有相对应的依赖关系。
#3. 各节点运行命令行,accelerate launch --config_file accelerate_multi_gpus_on_multi_nodes.yaml --machine_rank {rank} --main_process_ip {MasterIP} --main_process_port {MasterPort} pre_train.py
#   其中,rank为用户自定义的机器排序,主节点为0,MasterIP为主节点IP,MasterPort为主节点Port,在提交命令行时,各节点命令行仅需要修改rank。
accelerate launch --config_file accelerate_multi_gpus_on_multi_nodes.yaml --machine_rank {rank} --main_process_ip {MasterIP} --main_process_port {MasterPort} pre_train.py 
  
#6.预训练完之后,修改sft.py中的模型权重加载路径
#开启sft微调
sh train.sh sft.py

#7.修改test.py的权重路径,就能进行测试了
python3 test.py

🤖预训练

  1. 模型底座:模型的底座使用了qwen的模型,选择它的原因是:1.它是一个很成熟的中文大模型开源项目 2.我懒得自己构建tokenizer了,我看到qwen的tokenizer的压缩率挺好的,就直接拿来用了,既然tokenizer都拿了,就也直接用它的模型了

  2. 预训练语料(Corpus for pre-training ): 这次预训练用了以下几个经典数据集:

    Wiki中文百科:wikipedia-cn-20230720-filtered

    BaiduBaiKe:百度网盘 提取码: bwvb

    天工数据集:https://huggingface.co/datasets/Skywork/SkyPile-150B/tree/main/data

预训练语料预处理

数据预处理采取QWEN的通用做法,在末尾加上一个结束符号<|im_end|>,与下一个文章区分开。 如果文章超过规定的长度,将其截断,截断部分作为下个样本

💡SFT指令微调

LLM微调的目的是将预训练模型中的知识引导出来的一种手段,通俗的讲就是教会模型说人话。

  1. 微调方法:自然语言处理目前存在一个重要的范式:一般领域数据的大规模预训练,对特定任务或领域的适应。因此,为了让预训练模型在特定任务或领域有不错的表现,需要对模型进行微调。

    LLM微调方法

  2. SFT微调数据:LLM在垂直领域的适应已经是2023年的主格调,因此各个领域的SFT语料和微调模型层出不穷。目前已经有大佬整理并持续更新这方面的最新进展,大家有需要可以自己访问。

    本项目主要针对两类SFT语料进行模型微调,如下:

    日常问答SFT数据

    SFT语料 描述
    alpaca-zh:alpaca-zh 该数据集是参考Alpaca方法基于GPT4得到的self-instruct数据,约5万条。
    bell:bell 源自BelleGroup的一部分SFT数据。包含约300万条由BELLE项目生成的中文指令数据。

SFT样本构建

因为SFT语料一般较小,我们没必要提前分词,而是在构建Dataloader的时候进行分词构建batch送给模型。所以自行参考sft.py即可!

🤖DPO阶段

DPO算法称为直接偏好优化(Direct Preference Optimization),DPO的代码参考了中文对话0.2B小模型 ChatLM-Chinese-0.2B

数据集构建思路

整个构建分为四步:

  • 第一步,针对sft阶段不太合适的指令,构建待优化的指令数据集(也可以直接是上面提到的SFT语料数据集alpaca和bell),数据结构如下:
  {
    "instruction": "描述原子的结构。",
    "input": "",
    "output": "原子是物质的基本单位,它由三种基本粒子组成:质子、中子和电子。质子和中子形成原子核,位于原子中心,核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中,质子带正电,中子不带电(中性)。原子核非常小且致密,占据了原子总质量的绝大部分。电子带负电,通常围绕核运动,形成若干层次,称为壳层或电子层。电子数量与质子数量相等,使原子呈电中性。\n\n电子在每个壳层中都呈规律分布,并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子,其次一层最多可容纳8个电子,再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响:强力和电磁力。强力的作用范围非常小,主要限制在原子核内,具有极强的吸引作用,使核子(质子和中子)紧密结合在一起。电磁力的作用范围较大,主要通过核外的电子与原子核相互作用,发挥作用。\n\n这就是原子的基本结构。原子内部结构复杂多样,不同元素的原子核中质子、中子数量不同,核外电子排布分布也不同,形成了丰富多彩的化学世界。"
  }
  • 第二步,合并输入部分,生成目标数据集chosen,数据结构如下:
    {
        "prompt": "描述原子的结构。",
        "chosen": "原子是物质的基本单位,它由三种基本粒子组成:质子、中子和电子。质子和中子形成原子核,位于原子中心,核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中,质子带正电,中子不带电(中性)。原子核非常小且致密,占据了原子总质量的绝大部分。电子带负电,通常围绕核运动,形成若干层次,称为壳层或电子层。电子数量与质子数量相等,使原子呈电中性。\n\n电子在每个壳层中都呈规律分布,并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子,其次一层最多可容纳8个电子,再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响:强力和电磁力。强力的作用范围非常小,主要限制在原子核内,具有极强的吸引作用,使核子(质子和中子)紧密结合在一起。电磁力的作用范围较大,主要通过核外的电子与原子核相互作用,发挥作用。\n\n这就是原子的基本结构。原子内部结构复杂多样,不同元素的原子核中质子、中子数量不同,核外电子排布分布也不同,形成了丰富多彩的化学世界。"
    },
  • 第三步,通过第二步的SFT模型,输入prompt,如这里的“描述原子结构。”,得到结果“一个原子由质子、中子和电子组成,它们以特定的方式排列成一个原子核。”,从而构建rejected数据集,数据结构如下:
{
    'prompt': '描述原子的结构。', 
   'reject': '一个原子由质子、中子和电子组成,它们以特定的方式排列成一个原子核。'
}
  • 第四步,合并第二步和第三步的输入结果,数据结构如下:
  {
        "prompt": "描述原子的结构。",
        "chosen": "原子是物质的基本单位,它由三种基本粒子组成:质子、中子和电子。质子和中子形成原子核,位于原子中心,核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中,质子带正电,中子不带电(中性)。原子核非常小且致密,占据了原子总质量的绝大部分。电子带负电,通常围绕核运动,形成若干层次,称为壳层或电子层。电子数量与质子数量相等,使原子呈电中性。\n\n电子在每个壳层中都呈规律分布,并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子,其次一层最多可容纳8个电子,再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响:强力和电磁力。强力的作用范围非常小,主要限制在原子核内,具有极强的吸引作用,使核子(质子和中子)紧密结合在一起。电磁力的作用范围较大,主要通过核外的电子与原子核相互作用,发挥作用。\n\n这就是原子的基本结构。原子内部结构复杂多样,不同元素的原子核中质子、中子数量不同,核外电子排布分布也不同,形成了丰富多彩的化学世界。",
        "reject": "一个原子由质子、中子和电子组成,它们以特定的方式排列成一个原子核。"
    },

DPO训练

  • 第一步,使用dpo_train文件,修改其中的DpoConfig类,设置好对应的SFT路径和训练数据集路径即可
class DpoConfig:
    max_seq_len: int = 1024 + 8                  # 8 for eos token
    sft_model_file: str = '/MINI_LLM/model_save/checkpoint_sftmodel' # SFT后的模型路径
    tokenizer_dir: str = '/MINI_LLM/model_save/checkpoint_sftmodel'   # tokenizer一般和model权重放在同一个文件夹

    dpo_train_file: str = r'/MINILLM\MINI_LLM/datasets/my_dpo_train.json' # dpo的训练集
    dpo_eval_file: str = r'/MINILLM\MINI_LLM/datasets/my_dpo_eval.json' # dpo的测试集

    adapter_file: str = '/MINILLM\MINI_LLM//dpo/adapter_model.safetensors'
    log_dir: str = '/MINILLM\MINI_LLM/logs'

    ...

    output_dir: str = '/MINILLM\MINI_LLM//dpo'  # dpo模型输出路径
    ...
  • 第二步,执行dpo_train

模型对比

因需要配合sft模型才能看差别,因为其本质是让sft的模型更好的对齐你的目标数据而已,min(Π,Π*);可以在下面的链接中下载对应dpo数据,和待优化的sft模型,链接如下: 链接:https://pan.baidu.com/s/1GYeR6qrUhjsmpgh8-ABDpQ 提取码:dba9

🥇模型权重以及评测

权重下载

预训练权重:https://huggingface.co/Lil2J/mini_llm/tree/main

sft模型权重:https://huggingface.co/Lil2J/mini_llm_sft/tree/main

dpo模型权重:https://huggingface.co/wtxfrancise/mini_llm_dpo/tree/main

  1. 预训练模型

我首先先跑了Wiki中文百科 + BaiduBaiKe wiki+baidu.png 预训练语料: Wiki中文百科 + BaiduBaiKe

然后再跑天工的数据 sky.png 预训练语料: 天工数据集前20个文件

  1. sft模型

sft.png 微调语料: aplca数据+bell:train_2M_CN,train_1M_CN和train_0.5M_CN

  1. sft模型效果
#SFT微调模型的推理:test.py。
python3 test.py

wiki+baidu.png wiki+baidu.png wiki+baidu.png wiki+baidu.png wiki+baidu.png wiki+baidu.png wiki+baidu.png wiki+baidu.png wiki+baidu.png

  1. dpo模型 wiki+baidu.png dpo语料: alpaca数据+bell:train_1M_CN
  2. dpo模型效果 wiki+baidu.png wiki+baidu.png wiki+baidu.png wiki+baidu.png

其他

有什么问题和想一起搞大模型的可以加wx:ForeverM1LAn 进行交流

mini_llm's People

Contributors

elegantlin avatar jiahe7ay avatar keesh0410 avatar wtxfrancise avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mini_llm's Issues

requirements.txt安装报错

pip install -r requirements.txt时出现错误,有相同情况的吗? python版本是3.10
Collecting datasets (from -r requirements.txt (line 1))
Using cached datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting transformers==4.36.0 (from -r requirements.txt (line 2))
Using cached transformers-4.36.0-py3-none-any.whl.metadata (126 kB)
Collecting torch==2.2.0 (from -r requirements.txt (line 3))
Using cached torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting accelerate==0.27.2 (from -r requirements.txt (line 4))
Using cached accelerate-0.27.2-py3-none-any.whl.metadata (18 kB)
Collecting einops==0.7.0 (from -r requirements.txt (line 5))
Using cached einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Collecting flash-attn==2.5.5 (from -r requirements.txt (line 6))
Using cached flash_attn-2.5.5.tar.gz (2.5 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-h9ixb2zi/flash-attn_c3250dbfc30143cdbc07d297ff096fbe/setup.py", line 9, in
from packaging.version import parse, Version
ModuleNotFoundError: No module named 'packaging'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

数据截断问题

数据在qwen的tokenizer的时候会进行截断和加入<|im_end|>,那为什么还要在数据预处理的时候进行截断和加<|im_end|>?

采用4090多卡预训练的时候,出现权重文件保存错误

当运行sh train.sh pre_train.py时候,我采用8卡来运行脚本,但出现
Saving model checkpoint to ./model_save/pre/tmp-checkpoint-50
Configuration saved in ./model_save/pre/tmp-checkpoint-50/config.json
Configuration saved in ./model_save/pre/tmp-checkpoint-50/generation_config.json
Traceback (most recent call last):
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 281, in
Traceback (most recent call last):
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 281, in
trainer.train( #'model_save/pre/checkpoint-3400'
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
trainer.train( #'model_save/pre/checkpoint-3400'
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
return inner_training_loop(
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 259, in _save_checkpoint
super()._save_checkpoint(model, trial, metrics)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 259, in _save_checkpoint
super()._save_checkpoint(model, trial, metrics)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint
os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: './model_save/pre/tmp-checkpoint-50' -> './model_save/pre/checkpoint-50'
os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: './model_save/pre/tmp-checkpoint-50' -> './model_save/pre/checkpoint-50'
Traceback (most recent call last):
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 281, in
trainer.train( #'model_save/pre/checkpoint-3400'
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 259, in _save_checkpoint
super()._save_checkpoint(model, trial, metrics)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2350, in _save_checkpoint
self.save_model(staging_output_dir, _internal_call=True)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2837, in save_model
self._save(output_dir)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2897, in _save
Traceback (most recent call last):
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 281, in
self.model.save_pretrained(
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2352, in save_pretrained
trainer.train( #'model_save/pre/checkpoint-3400'
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
for filename in os.listdir(save_directory):
FileNotFoundError: [Errno 2] No such file or directory: './model_save/pre/tmp-checkpoint-50'
return inner_training_loop(
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 259, in _save_checkpoint
super()._save_checkpoint(model, trial, metrics)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint
os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: './model_save/pre/tmp-checkpoint-50' -> './model_save/pre/checkpoint-50'
Traceback (most recent call last):
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 281, in
trainer.train( #'model_save/pre/checkpoint-3400'
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ubuntu/XZT/LLM/LLM_Project/MINI_LLM-main/pre_train.py", line 259, in _save_checkpoint
super()._save_checkpoint(model, trial, metrics)
File "/home/ubuntu/anaconda3/envs/Fun_LLM/lib/python3.9/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint
os.rename(staging_output_dir, output_dir)

24G4090 预训练OOM

卡是24G的4090。把模型大小和batch都调到很小了,但是跑完三四百个样本就OOM了,求助哪里出了问题。
torch版本:2.2.2
transformers版本:4.39.3
修改后的模型结构如下:
QWenLMHeadModel(
(transformer): QWenModel(
(wte): Embedding(151936, 128)
(drop): Dropout(p=0.0, inplace=False)
(rotary_emb): RotaryEmbedding()
(h): ModuleList(
(0-7): 8 x QWenBlock(
(ln_1): RMSNorm()
(attn): QWenAttention(
(c_attn): Linear(in_features=128, out_features=384, bias=True)
(c_proj): Linear(in_features=128, out_features=128, bias=False)
(attn_dropout): Dropout(p=0.0, inplace=False)
)
(ln_2): RMSNorm()
(mlp): QWenMLP(
(w1): Linear(in_features=128, out_features=1024, bias=False)
(w2): Linear(in_features=128, out_features=1024, bias=False)
(c_proj): Linear(in_features=1024, out_features=128, bias=False)
)
)
)
(ln_f): RMSNorm()
)
(lm_head): Linear(in_features=128, out_features=151936, bias=False)
)

QWen size: 42.6M parameters

训练参数如下:
args= TrainingArguments(
output_dir=pretrain_args.model_save_dir,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=1,
ddp_find_unused_parameters=False,
gradient_checkpointing=True,
num_train_epochs=1,
weight_decay=0.1,
warmup_steps=1000,
learning_rate=5e-4,
evaluation_strategy='steps',
eval_steps=1000,
save_steps=1000,
save_strategy='steps',
save_total_limit=3,
report_to='tensorboard',
optim="adamw_torch",
lr_scheduler_type="cosine",
bf16=True,
logging_steps=5,
log_level='info',
logging_first_step=True,
eval_accumulation_steps=1,
# group_by_length=True,
# deepspeed='./ds_config_one_gpu.json',
)

数据预处理时的问题

lines.append(text + "<|im_end|>")
chunk_data = split_txt_corpus_to_chunk_en(lines)

这样如果前一个样本长度刚好在2048附近,会出现‘’<|im_end|>‘’的各个字符被截断分开到两个不同的样本中吗?

数据预处理问题

在generate_data 文件关于天工的数据处理的时候,新版没有加上标准的终止符<|im_end|>,这个是漏写了还是故意的?

预训练时遇到错误

老哥,执行了sh tran.sh pre_tran.py,修改成了单卡模式,为啥会出现下面这个错误,搜了一下好像是“模型或数据未正确移至相应设备:”。
Number of trainable parameters = 1,431,996,416
0%| | 0/5195 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/hlh/MINI_LLM-main/pre_train.py", line 236, in
trainer.train(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
self.optimizer.step(closure)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
adamw(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
func(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 509, in _multi_tensor_adamw
grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 397, in _group_tensors_by_device_and_dtype
return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 42, in _group_tensors_by_device_and_dtype
torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32 notwithstanding
0%| | 0/5195 [00:14<?, ?it/s]
Traceback (most recent call last):
File "/home/alex/miniconda3/envs/ChatGLM2-6b/bin/accelerate", line 8, in
sys.exit(main())
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/alex/miniconda3/envs/ChatGLM2-6b/bin/python', 'pre_train.py']' returned non-zero exit status 1.

预训练初始模型

感谢开源!
我看 Readme 预训练初始化模型用的是 Qwen 参数?配置支持只复用结构从 0 开始训练吗?

预训练到特定step出现OutOfMemoryError

重试多次,预训练均在到特定step出现OutOfMemoryError
61%|██████ | 12840/21167 [14:00:18<8:40:59, 3.75s/it]
61%|██████ | 12841/21167 [14:00:21<8:33:50, 3.70s/it]
61%|██████ | 12842/21167 [14:00:25<8:27:47, 3.66s/it]
61%|██████ | 12843/21167 [14:00:28<8:25:30, 3.64s/it]
61%|██████ | 12844/21167 [14:00:32<8:25:06, 3.64s/it]
61%|██████ | 12845/21167 [14:00:36<8:24:13, 3.64s/it]
61%|██████ | 12846/21167 [14:00:39<8:23:55, 3.63s/it]
61%|██████ | 12847/21167 [14:00:43<8:20:11, 3.61s/it]
61%|██████ | 12848/21167 [14:00:46<8:14:37, 3.57s/it]
61%|██████ | 12849/21167 [14:00:50<8:12:50, 3.56s/it]
61%|██████ | 12850/21167 [14:00:53<8:13:45, 3.56s/it]Traceback (most recent call last):
File "/home/tiger/MINI_LLM/pre_train.py", line 262, in
trainer.train( #'model_save/pre/checkpoint-3400'
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2758, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 817, in forward
return model_forward(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 805, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 784, in convert_to_fp32
return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 127, in recursively_apply
{
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 128, in
k: recursively_apply(
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 135, in recursively_apply
return func(data, *args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 779, in _convert_to_fp32
return tensor.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.04 GiB. GPU 7 has a total capacity of 79.35 GiB of which 8.27 GiB is free. Process 2987497 has 71.07 GiB memory in use. Of the allocated memory 65.19 GiB is allocated by PyTorch, and 3.39 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

数据统一截512最大长度,batch_size设置成16,gradient_accumulation_steps设置成8,刚开始启动训练时显存是够的
Uploading image.png…

请教在3090 * 8 的机器下,报错cuda out of memory怎么处理

File "/home/xiezizhe/anaconda3090/envs/willm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
shift_logits = lm_logits[..., :-1, :].contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.22 GiB. GPU 0 has a total capacity of 23.69 GiB of which 1.98 GiB is free. Process 38499 has 21.70 GiB memory in use. Of the allocated memory 20.49 GiB is allocated by PyTorch, and 818.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

预训练数据问题

想问下,sky数据集很大,整体下载有500G左右,麻烦是否能介绍下,模型训练用了哪些数据,总共多少tokens?

预训练数据的细节问题

  1. 先训练wiki和baidu有什么优势吗,把高质量数据放前面吗
  2. 接着上次预训练的权重,继续预训练。这时候是纯训练天工数据,还是天工+wiki+baidu
    感谢🙏

sft阶段不同卡显存占用不同

大佬你好sft阶段代码设置了max_length=512,truncation=True,为啥还会出现不同卡的显存占用不同,且有时候会out of memory的情况

没有找到data_process.py文件

+++++++++++++++++
···
3. 切换到dataset_utils目录下运行generate_data.py,运行前修改py文件,将处理数据的函数的注释去掉,才能运行起来
4. 运行data_process.py,在./datasets/目录下生成parquet文件
cd dataset_utils
python data_process.py
···
+++++++++++++++++

这个data_process.py没有找到,看了一下源码是不是generate_data.py?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.