Giter Site home page Giter Site logo

tencent / tencentpretrain Goto Github PK

View Code? Open in Web Editor NEW
977.0 20.0 138.0 42.14 MB

Tencent Pre-training framework in PyTorch & Pre-trained Model Zoo

Home Page: https://github.com/Tencent/TencentPretrain/wiki

License: Other

Python 100.00%
albert bart bert chinese classification clue elmo fine-tuning gpt gpt-2

tencentpretrain's Introduction

English | 中文

TencentPretrain: Tencent Pre-training Framework

Pre-training has become an essential part of AI technology. TencentPretrain is a toolkit for pre-training and fine-tuning on data of different modalities (e.g. text and vision). TencentPretrain is characterized by modular design. It facilitates the use of existing pre-training models, and provides interfaces for users to further extend upon. With TencentPretrain, we build a model zoo which contains pre-trained models of different properties. TencentPretrain inherits the open source toolkit UER (https://github.com/dbiir/UER-py/) and extends it to a multimodal pre-training framework.


Table of Contents


Features

TencentPretrain has the following features:

  • Reproducibility TencentPretrain has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, T5, CLIP.
  • Model modularity TencentPretrain is divided into the following parts: embedding, encoder, target embedding (optional), decoder (optional), and target. Ample modules are implemented in each part. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.
  • Multimodal TencentPretrain supports different modalities such as text, vision, and audio.
  • Model training TencentPretrain supports CPU mode, single GPU mode, distributed training mode, and gigantic model training with DeepSpeed.
  • Model zoo With the help of TencentPretrain, we pre-train and release models of different properties. Proper selection of pre-trained models is important to the performances of downstream tasks.
  • SOTA results TencentPretrain supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many competitions.
  • Abundant functions TencentPretrain provides abundant functions related with pre-training, such as feature extractor and text generation.

Requirements

  • Python >= 3.6
  • torch >= 1.1
  • six >= 1.12.0
  • argparse
  • packaging
  • regex
  • For the pre-trained model conversion (related with TensorFlow) you will need TensorFlow
  • For the tokenization with sentencepiece model you will need SentencePiece
  • For developing a stacking model you will need LightGBM and BayesianOptimization
  • For the pre-training with whole word masking you will need word segmentation tool such as jieba
  • For the use of CRF in sequence labeling downstream task you will need pytorch-crf
  • For the gigantic model training you will need DeepSpeed
  • For the vision model training you will need torchvision
  • For the audio model training you will need torchaudio, and opencv-python is needed for some special settings of specaugment, and editdistance is needed when finetuning a speech2text model

Quickstart

This section uses several commonly-used examples to demonstrate how to use TencentPretrain. More details are discussed in Instructions section. We firstly use BERT (a text pre-training model) on book review sentiment classification dataset. We pre-train model on book review corpus and then fine-tune it on book review sentiment classification dataset. There are three input files: book review corpus, book review sentiment classification dataset, and vocabulary. All files are encoded in UTF-8 and included in this project.

The format of the corpus for BERT is as follows (one sentence per line and documents are delimited by empty lines):

doc1-sent1
doc1-sent2
doc1-sent3

doc2-sent1

doc3-sent1
doc3-sent2

The book review corpus is obtained from book review sentiment classification dataset. We remove labels and split a review into two parts from the middle to construct a document with two sentences (see book_review_bert.txt in corpora folder).

The format of the classification dataset is as follows:

label    text_a
1        instance1
0        instance2
1        instance3

Label and instance are separated by \t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.

We use Google's Chinese vocabulary file models/google_zh_vocab.txt, which contains 21128 Chinese characters.

We firstly pre-process the book review corpus. In the pre-processing stage, the corpus needs to be processed into the format required by the specified pre-training model (--data_processor):

python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor bert

Notice that six>=1.12.0 is required.

Pre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (--processes_num). BERT tokenizer is used in default (--tokenizer bert). After pre-processing, the raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download Google's pre-trained Chinese BERT model google_zh_model.bin (in TencentPretrain format and the original model is from here), and put it in models folder. We load the pre-trained Chinese BERT model and further pre-train it on book review corpus. Pre-training model is usually composed of embedding, encoder, and target layers. To build a pre-training model, we should provide related information. Configuration file (--config_path) specifies the modules and hyper-parameters used by pre-training models. More details can be found in models/bert/base_config.json. Suppose we have a machine with 8 GPUs:

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/google_zh_model.bin \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/book_review_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32

mv models/book_review_model.bin-5000 models/book_review_model.bin

Notice that the model trained by pretrain.py is attacted with the suffix which records the training step (--total_steps). We could remove the suffix for ease of use.

Then we fine-tune the pre-trained model on downstream classification dataset. We use embedding and encoder layers of book_review_model.bin, which is the output of pretrain.py:

python3 finetune/run_classifier.py --pretrained_model_path models/book_review_model.bin \
                                   --vocab_path models/google_zh_vocab.txt \
                                   --config_path models/bert/base_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --epochs_num 3 --batch_size 32

The default path of the fine-tuned classifier model is models/finetuned_model.bin . It is noticeable that the actual batch size of pre-training is --batch_size times --world_size ; The actual batch size of downstream task (e.g. classification) is --batch_size . Then we do inference with the fine-tuned model.

python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
                                          --vocab_path models/google_zh_vocab.txt \
                                          --config_path models/bert/base_config.json \
                                          --test_path datasets/book_review/test_nolabel.tsv \
                                          --prediction_path datasets/book_review/prediction.tsv \
                                          --labels_num 2

--test_path specifies the path of the file to be predicted. The file should contain text_a column. --prediction_path specifies the path of the file with prediction results. We need to explicitly specify the number of labels by --labels_num. The above dataset is a two-way classification dataset.


The above content provides basic ways of using TencentPretrain to pre-process, pre-train, fine-tune, and do inference. More use cases can be found in complete ➡️ quickstart ⬅️ . The complete quickstart contains abundant use cases, covering most of the pre-training related application scenarios. It is recommended that users read the complete quickstart in order to use the project reasonably.


Pre-training data

This section provides links to a range of ➡️ pre-training data ⬅️ . TencentPretrain can load these pre-training data directly.


Downstream datasets

This section provides links to a range of ➡️ downstream datasets ⬅️ . TencentPretrain can load these datasets directly.


Modelzoo

With the help of TencentPretrain, we pre-trained models of different properties (e.g. models based on different modalities, encoders, and targets). Detailed introduction of pre-trained models and their download links can be found in ➡️ modelzoo ⬅️ . All pre-trained models can be loaded by TencentPretrain directly.


Instructions

TencentPretrain is organized as follows:

TencentPretrain/
    |--tencentpretrain/
    |    |--embeddings/ # contains modules of embedding component
    |    |--encoders/ # contains modules of encoder component such as RNN, CNN, Transformer
    |    |--decoders/ # contains modules of decoder component
    |    |--targets/ # contains modules of target component such as language modeling, masked language modeling
    |    |--layers/ # contains frequently-used NN layers
    |    |--models/ # contains model.py, which combines modules of different components
    |    |--utils/ # contains frequently-used utilities
    |    |--model_builder.py
    |    |--model_loader.py
    |    |--model_saver.py
    |    |--opts.py
    |    |--trainer.py
    |
    |--corpora/ # contains pre-training data
    |--datasets/ # contains downstream tasks
    |--models/ # contains pre-trained models, vocabularies, and configuration files
    |--scripts/ # contains useful scripts for pre-training models
    |--finetune/ # contains fine-tuning scripts for downstream tasks
    |--inference/ # contains inference scripts for downstream tasks
    |
    |--preprocess.py
    |--pretrain.py
    |--README.md
    |--README_ZH.md
    |--requirements.txt
    |--LICENSE

The code is organized based on components (e.g. embeddings, encoders). Users can use and extend upon it with little efforts.

Comprehensive examples of using TencentPretrain can be found in ➡️ instructions ⬅️ , which help users quickly implement pre-training models such as BERT, GPT-2, ELMo, T5, CLIP and fine-tune pre-trained models on a range of downstream tasks.


Competition solutions

TencentPretrain has been used in winning solutions of many competitions. In this section, we provide some examples of using TencentPretrain to achieve SOTA results on competitions, such as CLUE. See ➡️ competition solutions ⬅️ for more detailed information.


Citation

If you are using the work (e.g. pre-trained models) in TencentPretrain for academic work, please cite the system paper published in ACL 2023:

@article{zhao2023tencentpretrain,
  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
  journal={ACL 2023},
  pages={217},
  year={2023}
}

tencentpretrain's People

Contributors

eltociear avatar eric8932 avatar fengyh3 avatar hhou435 avatar jingzijingzi avatar karots123 avatar li-donglei avatar smilencelsy avatar winter523 avatar wmpscc avatar ydli-ai avatar yuzhangogogo avatar zhezhaoa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tencentpretrain's Issues

框架支持多机多卡吗?请问下怎么启动呢?

pretrain.py脚本中存在master_ip参数,使用deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero$1_config.json --dataset_path dataset.pt --spm_model_path ../tokenizer.model --config_path models/llama/$2b_config.json --output_model_path models/output_model.bin --world_size 8 --learning_rate 1e-4 --data_processor lm --total_steps 10000 --save_checkpoint_steps 2000 --batch_size $3 --log_path log/$1-$2-$3.log --total_steps 50 --deepspeed_checkpoint_activations --master_ip tcp://172.0.67.6:12914在slaver机器上可以直接开启训练。

Tasks

发现几个bug,dynamic_masking 意思是不是写反了,多了not

1、在utils 的 dataset.py 中,凡是有self.dynamic_masking的地方是不是写反了?,如:

 if not self.dynamic_masking:
            src, tgt = mask_seq(src, self.tokenizer, self.whole_word_masking)
        else:
            instance = ((src, pad_num), seg_pos)

而命令行写上 --dynamic_masking 就是dynamic_masking 是true , not true 才mask_seq,所以是多了not 吧?

2、 在generate_seq2seq.py 中32行
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
这里args.tokenizer 应该是args.tgt_tokenizer吧

单机2卡预训练LLAMA-7B报错TypeError: an integer is required (got type NoneType)

单机2卡训练报错:
Traceback (most recent call last):
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 139, in
main()
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 135, in main
trainer.train_and_validate(args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 147, in train_and_validate
worker(args.local_rank, None, args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 732, in worker
trainer.train(args, local_rank, global_rank, train_loader, model_for_training, optimizer, scheduler)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 193, in train
batch = list(next(loader_iter))
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/utils/dataloader.py", line 187, in iter
yield torch.LongTensor(src),
TypeError: an integer is required (got type NoneType)

训练命令如下:

CUDA_VISIBLE_DEVICES=6,7 deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 --pretrained_model_path models/llama2-7b.bin --dataset_path llama_support.pt --spm_model_path models/llama/tokenizer.model --config_path models/llama/7b_config.json --output_model_path models/llama_support_7b_dpw.bin --world_size 2 --gpu_ranks 0 1 --data_processor lm --deepspeed_checkpoint_activations --total_steps 300000 --save_checkpoint_steps 5000 --batch_size 1

这个错误的意思是数据有问题吗? 还是模型加载的有问题?

finetune llama but exits with return code = -9

hey thanks for your great work.

I try to finetune the llama on 8 V100 of 32 GB, but got error. I run this command

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json --pretrained_model_path models/llama-7b.bin --dataset_path dataset.pt --spm_model_path ../llama-dl/tokenizer.model --config_path models/llama/7b_config.json --output_model_path models/output_model.bin --world_size 8 --learning_rate 1e-4 --data_processor lm --total_steps 10000 --save_checkpoint_steps 2000 --batch_size 24

here is the log

[2023-03-13 11:36:31,475] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-13 11:36:31,860] [INFO] [runner.py:550:main] cmd = /TencentPretrain/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json --pretrained_model_path models/llama-7b.bin --dataset_path dataset.pt --spm_model_path ../llama-dl/tokenizer.model --config_path models/llama/7b_config.json --output_model_path models/output_model.bin --world_size 8 --learning_rate 1e-4 --data_processor lm --total_steps 10000 --save_checkpoint_steps 2000 --batch_size 24
[2023-03-13 11:36:38,091] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-03-13 11:36:38,091] [INFO] [launch.py:149:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-03-13 11:36:38,091] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-03-13 11:36:38,091] [INFO] [launch.py:162:main] dist_world_size=8
[2023-03-13 11:36:38,091] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-03-13 11:38:38,289] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93915
[2023-03-13 11:38:39,948] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93916
[2023-03-13 11:38:41,559] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93917
[2023-03-13 11:38:43,171] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93919
[2023-03-13 11:38:44,863] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93920
[2023-03-13 11:38:46,514] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93921
[2023-03-13 11:38:48,205] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93922
[2023-03-13 11:38:49,895] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 93924
[2023-03-13 11:38:49,898] [ERROR] [launch.py:324:sigkill_handler] ['/TencentPretrain/venv/bin/python', '-u', 'pretrain.py', '--local_rank=7', '--deepspeed', '--deepspeed_config', 'models/deepspeed_config.json', '--pretrained_model_path', 'models/llama-7b.bin', '--dataset_path', 'dataset.pt', '--spm_model_path', '../llama-dl/tokenizer.model', '--config_path', 'models/llama/7b_config.json', '--output_model_path', 'models/output_model.bin', '--world_size', '8', '--learning_rate', '1e-4', '--data_processor', 'lm', '--total_steps', '10000', '--save_checkpoint_steps', '2000', '--batch_size', '24'] exits with return code = -9

Killing process happens when loading the model weight

model_for_training = load_model(model_for_training, args.pretrained_model_path)

why the error happens here, how can I load the weight successfully

size mismatch for classifier.weight: copying a param with shape torch.Size([7, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).

微调多标签模型后,执行convert_bert_text_classification_from_tencentpretrain_to_huggingface.py后,模型predict报错:
Traceback (most recent call last):
File "/home/almalinux/TencentPretrain/demo.py", line 7, in
model = AutoModelForSequenceClassification.from_pretrained(model_path)
File "/home/almalinux/miniconda3/envs/bert_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
File "/home/almalinux/miniconda3/envs/bert_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
) = cls._load_pretrained_model(
File "/home/almalinux/miniconda3/envs/bert_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3310, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
size mismatch for classifier.weight: copying a param with shape torch.Size([7, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
size mismatch for classifier.bias: copying a param with shape torch.Size([7]) from checkpoint, the shape in current model is torch.Size([2]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

lora训练llama 貌似不支持?

如题:命令如下:
python pretrain.py --pretrained_model_path models/llama-7b.bin --dataset_path datasets/ceshi --spm_model_path /u01/wangcheng/llm/llama/tokenizer.model --config_path models/llama/7b_config.json --output_model_path models/llama_zh_7b --world_size 5 --data_processor lm --total_steps 300000 --save_checkpoint_steps 5000 --batch_size 24 --use_lora --lora_dropout 0.05

只运行到Using distributed mode for training. 就结束了?

LLaMA2-70B格式转换

请问如果要进行LLaMA2-70B的huggingface转tencentpretrain格式的模型,应该如何i修改llama的那个转化脚本?

KeyError: 'd'

python finetune/run_classifier.py --pretrained_model_path models/roberta-base-finetuned-dianping-chinese/pytorch_model.bin
--vocab_path models/google_zh_vocab.txt
--config_path models/sbert/base_config.json
--output_model_path models/test_model.bin
--train_path datasets/book_review/train.tsv
--dev_path datasets/book_review/dev.tsv
--test_path datasets/book_review/test.tsv
--epochs_num 3
--batch_size 32
--learning_rate 3e-5
--seq_length 512

Traceback (most recent call last):
File "/home/almalinux/TencentPretrain/finetune/run_classifier.py", line 366, in
main()
File "/home/almalinux/TencentPretrain/finetune/run_classifier.py", line 291, in main
model = Classifier(args)
File "/home/almalinux/TencentPretrain/finetune/run_classifier.py", line 33, in init
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
KeyError: 'd'

lora貌似不支持ddp?

Can't pickle local object 'init..'

Lora在使用ddp的时候,lambda函数无法pickle

建议直接在forward函数加入 if dropout>0的逻辑判断,而不是初始化,进而避免lambda函数的使用。

About Training log

截屏2023-03-14 下午3 55 05

Hi! During the process of fine-tuning Llama by following the tutorial, I only received warnings related to the scaling factor used in mixed precision training and did not receive any logs related to the training process itself. Is this normal?

多个GPU上都保存模型

跑8卡预训练的代码,结果保存模型的时候每张卡上都给我存了个模型:
image

这每张卡上的模型都是一样的吗?这里的代码是不是有bug?是不是只需要在rank==0上save?

if args.deepspeed:
if self.current_step % self.save_checkpoint_steps == 0:
if args.use_lora:
if rank == 0:
save_model(model, self.output_model_path + "-" + str(self.current_step), args.use_lora)
else:
model.save_checkpoint(self.output_model_path, str(self.current_step))

关于扩充词表再增量预训练的疑问

  1. Linly-Chinese-LLaMA-7B没有扩充词表。如果我自己训练了一个sentencepiece词表并和原版词表合并以后,要怎么改动代码扩充embedding进行预训练呢?
  2. 是否支持llama 13b的增量预训练呢?

lora推理

使用lora做预训练,如何执行推理呢?

DeepSpeedZeRoOffload initialization failed (can't allocate memory)

Hi I am basically trying the instruction in this guide https://github.com/CVI-SZU/Linly/wiki/%E5%A2%9E%E9%87%8F%E8%AE%AD%E7%BB%83 to use TencentPretrain for pretraining, but it threw an error saying there is no enough CPU memoryy for deepspeed initialization.

Here is the command I ran:
(The data (pt file) in dataset_path is just a small one generated from 100 lines of text. I just wanna see if the training can get started properly in my environment.)

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 \
 --pretrained_model_path ${modeldir}/chinese_llama_7b.bin \
 --dataset_path ${datadir}/chinese_llama_small_test1.pt \
 --spm_model_path ${modeldir}/tokenizer.model \
 --config_path models/llama/7b_config.json \
 --output_model_path ${outdir}/chinese_llama_7b_pretrain_small_test \
 --world_size 1 --data_processor lm --deepspeed_checkpoint_activations \
 --total_steps 10000 --save_checkpoint_steps 5000 --batch_size 24

Here is the config for zero_optimization (I didn't change any configuration in models/deepspeed_zero3_config.json):

  "zero_optimization": {
      "stage": 3,
      "offload_param": {
          "device": "cpu",
          "pin_memory": true
      },
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory":true
      }
  },

Here is the full log.

[2023-05-24 10:00:03,687] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-24 10:00:03,760] [INFO] [runner.py:550:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 --pretrained_model_path /workspace/TencentPretrain/../models/llama_chinese/chinese_llama_7b.bin --dataset_path /workspace/TencentPretrain/../datasets/preprocessed/chinese_llama_small_test1.pt --spm_model_path /workspace/TencentPretrain/../models/llama_chinese/tokenizer.model --config_path models/llama/7b_config.json --output_model_path /workspace/TencentPretrain/../models/output/chinese_llama_7b_pretrain_small_test --world_size 1 --data_processor lm --deepspeed_checkpoint_activations --total_steps 10000 --save_checkpoint_steps 5000 --batch_size 24
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.9.9-1+cuda11.3
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.9.9-1
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.9.9-1
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.9.9-1+cuda11.3
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.9.9-1
[2023-05-24 10:00:05,026] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-24 10:00:05,026] [INFO] [launch.py:149:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-24 10:00:05,026] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-24 10:00:05,026] [INFO] [launch.py:162:main] dist_world_size=1
[2023-05-24 10:00:05,026] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-24 10:00:06,577] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-24 10:00:25,630] [INFO] [partition_parameters.py:416:__exit__] finished initializing model with 6.74B parameters
^[[21~[2023-05-24 10:01:55,229] [WARNING] [cpu_adam.py:86:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.713547706604004 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-05-24 10:02:10,518] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-05-24 10:02:10,532 INFO] Added key: store_based_barrier_key:2 to store for rank: 0
[2023-05-24 10:02:10,532 INFO] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-05-24 10:02:10,533] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: True
[2023-05-24 10:02:10,534] [INFO] [logging.py:93:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-05-24 10:02:10,534] [INFO] [logging.py:93:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-05-24 10:02:10,546] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-05-24 10:02:10,546] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-05-24 10:02:10,546] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
[2023-05-24 10:02:10,614] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning
[2023-05-24 10:02:10,614] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.73 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:10,615] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 23.44 GB, percent = 24.8%
[2023-05-24 10:02:10,617] [INFO] [stage3.py:113:__init__] Reduce bucket size 500000000
[2023-05-24 10:02:10,617] [INFO] [stage3.py:114:__init__] Prefetch bucket size 50000000
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.2768416404724121 seconds
[2023-05-24 10:02:10,934] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-05-24 10:02:10,935] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:10,935] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 23.44 GB, percent = 24.8%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-05-24 10:02:10,993] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-05-24 10:02:10,994] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:10,994] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 23.44 GB, percent = 24.8%
[2023-05-24 10:02:11,033] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions
[2023-05-24 10:02:11,034] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:11,034] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 23.44 GB, percent = 24.8%
[2023-05-24 10:02:16,329] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 7
[2023-05-24 10:02:16,330] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:16,330] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 39.73 GB, percent = 42.1%
[2023-05-24 10:02:16,369] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions
[2023-05-24 10:02:16,370] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:16,370] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 39.73 GB, percent = 42.1%
[2023-05-24 10:02:19,337] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions
[2023-05-24 10:02:19,337] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:19,337] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 64.88 GB, percent = 68.7%
[2023-05-24 10:02:19,376] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-05-24 10:02:19,377] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB 
[2023-05-24 10:02:19,377] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory:  used = 64.88 GB, percent = 68.7%
Traceback (most recent call last):
  File "pretrain.py", line 134, in <module>
    main()
  File "pretrain.py", line 130, in main
    trainer.train_and_validate(args)
  File "/workspace/TencentPretrain/tencentpretrain/trainer.py", line 79, in train_and_validate
    worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
  File "/workspace/TencentPretrain/tencentpretrain/trainer.py", line 634, in worker
    dist_init_required=False)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/__init__.py", line 135, in initialize
    config_params=config_params)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1626, in _configure_zero_optimizer
    communication_data_type=self.communication_data_type)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 312, in __init__
    self._setup_for_real_optimizer()
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 371, in _setup_for_real_optimizer
    self.initialize_optimizer_states()
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 926, in initialize_optimizer_states
    device=self.device)
RuntimeError: [enforce fail at alloc_cpu.cpp:66] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4047667200 bytes. Error code 12 (Cannot allocate memory)
[2023-05-24 10:02:31,188] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 228
[2023-05-24 10:02:31,189] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python', '-u', 'pretrain.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'models/deepspeed_zero3_config.json', '--enable_zero3', '--pretrained_model_path', '/workspace/TencentPretrain/../models/llama_chinese/chinese_llama_7b.bin', '--dataset_path', '/workspace/TencentPretrain/../datasets/preprocessed/chinese_llama_small_test1.pt', '--spm_model_path', '/workspace/TencentPretrain/../models/llama_chinese/tokenizer.model', '--config_path', 'models/llama/7b_config.json', '--output_model_path', '/workspace/TencentPretrain/../models/output/chinese_llama_7b_pretrain_small_test', '--world_size', '1', '--data_processor', 'lm', '--deepspeed_checkpoint_activations', '--total_steps', '10000', '--save_checkpoint_steps', '5000', '--batch_size', '24'] exits with return code = 1

The pretrained_model to start with is a model with about 7B parameters and the size of it is 13GB. I am trying to continue training based on this model.

Hardware-wise, I have one GPU (A100, 40G), and I have about 90G of cpu memory available, I assume it is not a very small one? (or it is?).

free -h
              total        used        free      shared  buff/cache   available
Mem:            94G        2.5G         89G         25M        2.5G         90G
Swap:            0B          0B          0B

My local environment:

deepspeed 0.8.3,  torch1.12.1, py3.7, cuda11.3, cudnn8.3.2.

I am new to deepspeed and traning large models, so do let me know if my description is not clear.
I wanna know if there is any way to get around with this OOM issue? And how to estimate the memory required for deepspeed training with known pretrained model size?
ps: Based on this FAQ https://github.com/CVI-SZU/Linly#faq (Q2), I only need cpu memory of model_size*gpu_number, which should only be around 14G in my case (14G*1), but actually now I have 90G and it is still not proceeding properly.
Thank you!

no GPU usage and only CPU running when inference

Training operates well with GPUs. However when I run inference, no GPU usage and only CPU running...
I use following script to do inference.

python3 scripts/generate_lm.py --load_model_path models/llama-7b.bin --spm_model_path $LLaMA_7B_FOLDER/tokenizer.model \
                               --test_path beginning.txt --prediction_path generated_sentence.txt \
                               --config_path models/llama/7b_config.json 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.