Giter Site home page Giter Site logo

transpeeder's People

Contributors

huanglk avatar jy-ren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

transpeeder's Issues

ๆ‹†ๅˆ†ๆˆ4ไธชๅˆ†็‰‡ๆจกๅž‹๏ผŒๅฆ‚ไฝ•่ฎพ็ฝฎๆฏไธชGPUๅชๅŠ ่ฝฝไธ€ไธชๅˆ†็‰‡

ๆ‹†ๅˆ†ๆˆppๆจกๅž‹๏ผŒๆŒ‰็…ง็›ฎๅ‰train_llama_deepspeed.shๅŠ ่ฝฝๆ–นๆณ•๏ผŒๆฏไธชGPUๅŠ ่ฝฝไธ€้๏ผŒ่ฟ™่ทŸๆฒกๆ‹†ๅˆ†ๆฒกๆœ‰ๅŒบๅˆซใ€‚ๅฆ‚ๆžœๆ˜ฏไธช4ๅˆ†็‰‡็š„ๆจกๅž‹๏ผŒๅฆ‚ไฝ•่ฎพ็ฝฎ0,1,2,3 ๅŠ ่ฝฝไธ€ไธชๆจกๅž‹๏ผŒ4,5,6,7ๅŠ ่ฝฝไธ€ไธชๆจกๅž‹๏ผŒ่€Œไธๆ˜ฏๅŠ ่ฝฝ8ไธชๆจกๅž‹ใ€‚
image

flash_attn_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol

I had flash_attn reinstalled

(gh_llama-deepspeed) r730ub20@r730ub20-M0:/llm_dev/llama-deepspeed$ python3 scripts/convert2ckpt.py --model_name_or_path /data-ssd-1t/hf_model/llama-7b-hf/ --output_dir llama-7b-init-ckpt/
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/r730ub20/llm_dev/llama-deepspeed/scripts/convert2ckpt.py:11 in โ”‚
โ”‚ โ”‚
โ”‚ 8 import torch โ”‚
โ”‚ 9 import transformers โ”‚
โ”‚ 10 โ”‚
โ”‚ โฑ 11 from models.patching import ( โ”‚
โ”‚ 12 โ”‚ smart_tokenizer_and_embedding_resize, โ”‚
โ”‚ 13 ) โ”‚
โ”‚ 14 from feeder import ( โ”‚
โ”‚ โ”‚
โ”‚ /home/r730ub20/llm_dev/llama-deepspeed/./models/patching.py:11 in โ”‚
โ”‚ โ”‚
โ”‚ 8 from transformers.models.llama.modeling_llama import apply_rotary_pos_emb โ”‚
โ”‚ 9 โ”‚
โ”‚ 10 from einops import rearrange โ”‚
โ”‚ โฑ 11 from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func โ”‚
โ”‚ 12 from flash_attn.bert_padding import unpad_input, pad_input โ”‚
โ”‚ 13 โ”‚
โ”‚ 14 โ”‚
โ”‚ โ”‚
โ”‚ /home/r730ub20/.local/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py:5 in โ”‚
โ”‚ โ”‚
โ”‚ โ”‚
โ”‚ 2 import torch.nn as nn โ”‚
โ”‚ 3 import torch.nn.functional as F โ”‚
โ”‚ 4 โ”‚
โ”‚ โฑ 5 import flash_attn_cuda โ”‚
โ”‚ 6 โ”‚
โ”‚ 7 โ”‚
โ”‚ 8 def _get_block_size(device, head_dim, is_dropout): โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
ImportError: /home/r730ub20/.local/lib/python3.8/site-packages/flash_attn_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE
(gh_llama-deepspeed) r730ub20@r730ub20-M0:
/llm_dev/llama-deepspeed$

ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface'

ไฝ ๅฅฝ๏ผŒๆˆ‘ๅœจๆ‰ง่กŒpython convert2ckpt.py --mp_world_size 4 --model_name_or_path /path/to/llama-7b-hf --output_dir /path/to/llama-7b-init-ckptๆ—ถๆŠฅไบ†ไปฅไธ‹้”™่ฏฏ๏ผš

`ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface'

็œ‹ไบ†ไธ‹flash_attn.flash_attn_interface่„šๆœฌ้‡Œ้ข็กฎๅฎžๆฒกๆœ‰flash_attn_unpadded_qkvpacked_funcๅ‡ฝๆ•ฐ๏ผŒๆˆ‘็”จ็š„็Žฏๅขƒๆ˜ฏpytorch1.13, python3.10, flash-attn.2.0.8, ่ƒฝๅฆๆไพ›ไธ‹ไฝ ็š„็Žฏๅขƒๆˆ–่€…่งฃๅ†ณๆ–นๆกˆๅ—๏ผŸ

ไฝฟ็”จ4ๅ—3090ๅ…จ้‡ๅพฎ่ฐƒ7B-llamaๆ—ถๅ‘็ŽฐๅฏๅŠจ่ฎญ็ปƒ้˜ถๆฎต้žๅธธ็ผ“ๆ…ข

image
็‰นๅˆซๆ˜ฏๅˆๅง‹ๅŒ–่ฟ™ไธชwandb็š„้˜ถๆฎต๏ผŒๅกไบ†10ๅ‡ ๅˆ†้’Ÿ๏ผŒ่ฏท้—ฎไฝ ๆœ‰้‡ๅˆฐๅคšๅก่ฎญ็ปƒๅฏๅŠจๆ…ข็š„้—ฎ้ข˜ๅ—๏ผŒๆœ‰ไป€ไนˆๅฏ่ƒฝ็š„ๆ”นๅ–„ๆ–นๆกˆๅ—@HuangLK

Flash attention integration failed

Hello,

when I try to use flash attention, I have encountered the following problem:

โ”‚ /export/home2/fangkai/merit-v2/trainer_base_ds_mp.py:346 in main             โ”‚
โ”‚                                                                              โ”‚
โ”‚   343 โ”‚   โ”‚   โ”‚   logger.info("Resuming training from the latest checkpoint: โ”‚
โ”‚   344 โ”‚   โ”‚   โ”‚   continue_from_global_step = int(checkpoint.split('-')[-1]) โ”‚
โ”‚   345 โ”‚   โ”‚                                                                  โ”‚
โ”‚ โฑ 346 โ”‚   โ”‚   global_step, tr_loss = train(cfg, model_pipe, tokenizer, conti โ”‚
โ”‚   347 โ”‚   โ”‚   logger.info(" global_step = %s, average loss = %s", global_ste โ”‚
โ”‚   348                                                                        โ”‚
โ”‚   349                                                                        โ”‚
โ”‚                                                                              โ”‚
โ”‚ /export/home2/fangkai/merit-v2/trainer_base_ds_mp.py:236 in train            โ”‚
โ”‚                                                                              โ”‚
โ”‚   233 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   continue                                           โ”‚
โ”‚   234 โ”‚   โ”‚   โ”‚   โ”‚                                                          โ”‚
โ”‚   235 โ”‚   โ”‚   โ”‚   โ”‚   model.train()                                          โ”‚
โ”‚ โฑ 236 โ”‚   โ”‚   โ”‚   โ”‚   loss = model.train_batch(data_iter=sub_train_dataloade โ”‚
โ”‚   237 โ”‚   โ”‚   โ”‚   โ”‚   global_step += 1                                       โ”‚
โ”‚   238 โ”‚   โ”‚   โ”‚   โ”‚                                                          โ”‚
โ”‚   239 โ”‚   โ”‚   โ”‚   โ”‚   tr_loss += loss.item()                                 โ”‚
โ”‚                                                                              โ”‚
โ”‚ /export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/de โ”‚
โ”‚ epspeed/runtime/pipe/engine.py:336 in train_batch                            โ”‚
โ”‚                                                                              โ”‚
โ”‚    333 โ”‚   โ”‚   sched = schedule.TrainSchedule(micro_batches=self.micro_batch โ”‚
โ”‚    334 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚      stages=self.num_stages,        โ”‚
โ”‚    335 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚      stage_id=self.stage_id)        โ”‚
โ”‚ โฑ  336 โ”‚   โ”‚   self._exec_schedule(sched)                                    โ”‚
โ”‚    337 โ”‚   โ”‚   self.agg_train_loss = self._aggregate_total_loss()            โ”‚
โ”‚    338 โ”‚   โ”‚                                                                 โ”‚
โ”‚    339 โ”‚   โ”‚   self.timers('train_batch').stop()                             โ”‚
โ”‚                                                                              โ”‚
โ”‚ /export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/de โ”‚
โ”‚ epspeed/runtime/pipe/engine.py:1307 in _exec_schedule                        โ”‚
โ”‚                                                                              โ”‚
โ”‚   1304 โ”‚   โ”‚   โ”‚   โ”‚                                                         โ”‚
โ”‚   1305 โ”‚   โ”‚   โ”‚   โ”‚   # Equivalent to: self._exec_forward_pass(buffer_id=0) โ”‚
โ”‚   1306 โ”‚   โ”‚   โ”‚   โ”‚   self._exec_instr = MethodType(self._INSTRUCTION_MAP[t โ”‚
โ”‚ โฑ 1307 โ”‚   โ”‚   โ”‚   โ”‚   self._exec_instr(**cmd.kwargs)                        โ”‚
โ”‚   1308                                                                       โ”‚
โ”‚                                                                              โ”‚
โ”‚ /export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/de โ”‚
โ”‚ epspeed/runtime/pipe/engine.py:996 in _exec_send_grads                       โ”‚
โ”‚                                                                              โ”‚
โ”‚    993 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   if not buffer.is_floating_point():                โ”‚
โ”‚    994 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   assert buffer.grad is None                    โ”‚
โ”‚    995 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   continue                                      โ”‚
โ”‚ โฑ  996 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   assert buffer.grad is not None                    โ”‚
โ”‚    997 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   p2p.send(buffer.grad, self.prev_stage)            โ”‚
โ”‚    998 โ”‚   โ”‚                                                                 โ”‚
โ”‚    999 โ”‚   โ”‚   # We can free up the input buffer now                         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
AssertionError

I also test it by using the torch.nn.functional.scaled_dot_product_attention, which implements flash attention in torch2.0, but I met the same problem. May I know if you have encountered with the problem?

Thanks for your help very much!

Best,
Fangkai

ๅ…ณไบŽbatchsize้—ฎ้ข˜

ไฝ ๅฅฝ๏ผŒconfig.json้‡Œtrain_micro_batch_size_per_gpuๅœจpipelineๆœบๅˆถไธ‹ๆ˜ฏ่กจ็คบchunkๅ—๏ผŸtrain_batch_sizeๆ˜ฏๆ€ป็š„batch sizeใ€‚

ๆจกๅž‹ๅŠ ่ฝฝ

ๆ‚จๅฅฝ๏ผŒๆˆ‘็œ‹ๆ‚จๅœจParallelTransformerLayerPipe้‡ŒๅขžๅŠ ไบ†self.activation_checkpointing = activation_checkpointing,ไฝ†ๆ˜ฏ่ฟ™ไธชๅ‚ๆ•ฐๅœจllamaๆจกๅž‹้‡Œๆ˜ฏๆฒกๆœ‰็š„๏ผŒๅŠ ่ฝฝllama็š„ๆจกๅž‹ไธไผšๅ‡บ้”™ๅ—ใ€‚
ๆˆ‘็œ‹ๅœจๆ›ดๆ–ฐ็š„ไปฃ็ ไธญ๏ผŒๆ˜ฏๅ…ˆๆŠŠhfๆ ผๅผ่ฝฌๅŒ–ไธบdeepspeed็š„ๆ ผๅผ๏ผŒ็„ถๅŽengine.load_checkpoint(model_args.init_ckpt, load_module_only=True)ๅŠ ่ฝฝ๏ผŒ่ฟ™ไธชๅœฐๆ–นๅŠ ่ฝฝ็š„่ฟ‡็จ‹ไธญไผš้ป˜่ฎคไธๅŠ ่ฝฝๅ—๏ผŸ

how can run it with 24G GPU card like 3090

I got GPU OOM

(gh_llama-deepspeed) amd00@asus00:/llm_dev/llama-deepspeed$
(gh_llama-deepspeed) amd00@asus00:
/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:15:04,883] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-31 17:15:04,892] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:15:06,134] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-31 17:15:06,134] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-31 17:15:06,134] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-31 17:15:06,134] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-31 17:15:06,134] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-31 17:15:07,635] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 3358.13it/s]
total samples num: 50
Traceback (most recent call last):
File "train.py", line 130, in
main()
File "train.py", line 99, in main
model = get_model(model_config, ds_args, activation_checkpointing_config)
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 167, in get_model
print("pp is %d, mp is %d, world_size is:", pp, mp, args.world_size)
UnboundLocalError: local variable 'pp' referenced before assignment
[2023-05-31 17:15:08,142] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26374
[2023-05-31 17:15:08,143] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1
(gh_llama-deepspeed) amd00@asus00:/llm_dev/llama-deepspeed$ vim train.py
(gh_llama-deepspeed) amd00@asus00:
/llm_dev/llama-deepspeed$ vim models/llama_pipeline_model.py
(gh_llama-deepspeed) amd00@asus00:/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:16:32,333] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-31 17:16:32,342] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:16:33,582] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-31 17:16:33,582] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-31 17:16:33,582] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-31 17:16:33,582] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-31 17:16:33,582] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-31 17:16:35,093] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 3368.92it/s]
total samples num: 50
pp is %d, mp is %d, world_size is: 1 1 1
SEED_LAYERS=False BASE_SEED=42 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0}
[2023-05-31 17:16:35,204] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=35
0: EmbeddingPipe
1: ParallelTransformerLayerPipe
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: ParallelTransformerLayerPipe
15: ParallelTransformerLayerPipe
16: ParallelTransformerLayerPipe
17: ParallelTransformerLayerPipe
18: ParallelTransformerLayerPipe
19: ParallelTransformerLayerPipe
20: ParallelTransformerLayerPipe
21: ParallelTransformerLayerPipe
22: ParallelTransformerLayerPipe
23: ParallelTransformerLayerPipe
24: ParallelTransformerLayerPipe
25: ParallelTransformerLayerPipe
26: ParallelTransformerLayerPipe
27: ParallelTransformerLayerPipe
28: ParallelTransformerLayerPipe
29: ParallelTransformerLayerPipe
30: ParallelTransformerLayerPipe
31: ParallelTransformerLayerPipe
32: ParallelTransformerLayerPipe
33: LayerNormPipe
34: LMLayerPipe
loss: loss_fn
Traceback (most recent call last):
File "train.py", line 130, in
main()
File "train.py", line 99, in main
model = get_model(model_config, ds_args, activation_checkpointing_config)
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 182, in get_model
return GPT2ModelPipe(model_config,
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 157, in init
super().init(
File "/home/amd00/.local/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 200, in init
self.to(get_accelerator().device_name(self.local_rank))
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
return self._apply(convert)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
param_applied = fn(param)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.70 GiB total capacity; 22.83 GiB already allocated; 97.88 MiB free; 22.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-31 17:17:30,649] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26532
[2023-05-31 17:17:30,650] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1
(gh_llama-deepspeed) amd00@asus00:
/llm_dev/llama-deepspeed$

ๅ››ๅก่ฎญ7B-llamaๆธ…็ฉบ็ผ“ๅญ˜ๅ†่ฎญ็ปƒๆŠฅ้”™

ไฝฟ็”จ้ป˜่ฎค็š„ds_config.json้…็ฝฎๆ–‡ไปถ๏ผŒๅชไฟฎๆ”นไบ†wandb้ƒจๅˆ†ไธบfalse(ๅ› ไธบๆ…ข)๏ผŒ็„ถๅŽๅฐฑๅ‘็Žฐๆ˜พๅญ˜ๅˆ†้…ไบ†ๅดไธๅผ€ๅง‹่ฎญ็ปƒ๏ผˆๅกๅœจUsing /root/.cache/torch_extensions as PyTorch extensions root...๏ผ‰
ไบŽๆ˜ฏๆธ…็ฉบroot/.cacheๅŽๅ†้‡ๆ–ฐ่ฎญ็ปƒ๏ผŒๅฐฑๅ‘็ŽฐๆŠฅ้”™ไบ†๏ผŒerrorไฟกๆฏๅฆ‚ไธ‹

Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu116/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] =/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
=/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
/bin/sh: 1: =/usr/local/cuda-11.6/bin/nvcc: not found
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
[2023-04-21 17:47:56,170] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-04-21 17:47:56,315] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
_write_ninja_file_and_build_library(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fused_adam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fused_adam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fused_adam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2023-04-21 17:48:12,493] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105683
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105684
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105685
[2023-04-21 17:48:12,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105686
[2023-04-21 17:48:12,847] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--output_dir', '/root/nas-private/output', '--init_ckpt', '/root/nas-private/llama-7B-init-ckpt', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '1024', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '4', '--model_parallel_size', '1', '--use_flash_attn', 'true', '--deepspeed_config', './configs/ds_config.json'] exits with return code = 1

error when use zero1

Traceback (most recent call last): File "train.py", line 131, in <module> main() File "train.py", line 109, in main engine.load_checkpoint(model_args.init_ckpt,load_module_only=True)#load_module_only=True File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2769, in load_checkpoint success = self._load_zero_checkpoint( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2948, in _load_zero_checkpoint zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3042, in _get_all_zero_checkpoints return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3014, in _get_all_zero_checkpoint_state_dicts _state = self.checkpoint_engine.load( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in load partition = torch.load(path, map_location=map_location) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 699, in load with _open_file_like(f, 'rb') as opened_file: File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 231, in _open_file_like return _open_file(name_or_buffer, mode) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 212, in __init__ super(_open_file, self).__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './llama-7B-init-ckpt/global_step001/zero_pp_rank_0_mp_rank_01_optim_states.pt' [2023-08-13 20:35:08,552] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from ./llama-7B-init-ckpt/global_step001/zero_pp_rank_0_mp_rank_02_optim_states.pt...

pipeline model็š„ไธ€ไบ›้—ฎ้ข˜

ๅ—จ๏ผŒๅธ…ๅ“ฅ๏ผŒไฝ ่ฟ™่พน็š„ๅทฅ็จ‹ๅพˆๆฃ’๏ผŒๆˆ‘ๅœจๅญฆไน ็š„ๅทฅ็จ‹ไธญๆœ‰ไธ€ไบ›็–‘้—ฎ๏ผŒๅธŒๆœ›ไฝ ่ƒฝๆŠฝ็ฉบ่งฃ็ญ”ไธ€ไธ‹ใ€‚ๅ…ทไฝ“้—ฎๆๅฆ‚ไธ‹๏ผš

  1. pipeline modelไธŽๅŠ ่ฝฝ้ข„่ฎญ็ปƒๆจกๅž‹็š„ๅ…ˆๅŽ้กบๅบ้—ฎ้ข˜๏ผŒๆ˜ฏๅ› ไธบckpt้‡Œๅชๅญ˜ไบ†ๆƒ้‡ไฟกๆฏ๏ผŒๆ‰€ไปฅไธ€ๅฎš่ฆๅ…ˆๅฎšไน‰ๆจกๅž‹๏ผŒ็„ถๅŽๅŠ ่ฝฝๆƒ้‡๏ผŸ
  2. engine.load_checkpoint่ฟ™้‡Œๆ˜ฏไธๆ˜ฏๅฟ…้กปๅŠ ่ฝฝckpt๏ผŒhfๆ ผๅผ็š„ไธบๅ•ฅไธๅฏไปฅๅ‘ข๏ผŸ
    # pipeline model
    model = get_model(model_config, ds_args, activation_checkpointing_config)

    engine, _, _, _ = deepspeed.initialize(
        ds_args,
        model=model,
        model_parameters=[p for p in model.parameters() if p.requires_grad]
    )

    # use `convert2ckpt.py`
    engine.load_checkpoint(model_args.init_ckpt, load_module_only=True)

train.pyไธญๅŠ ่ฝฝcheckpointไผผไนŽๆฒกๆ•ˆ

train.pyไธญ็š„็ฌฌ108่กŒ

engine.load_checkpoint(model_args.init_ckpt, load_module_only=True)

ๆœ‰ๆฒกๆœ‰่ฟ™ไธ€่กŒ๏ผŒ่ฎญ็ปƒๅˆๅง‹็š„loss้ƒฝไธ€ๆ ทใ€‚ๅฅฝๅƒๅนถๆฒกๆœ‰ๆˆๅŠŸๅŠ ่ฝฝๅˆฐๆจกๅž‹ๅ‚ๆ•ฐ

Running 7b succeed. next 30B

image

Thank your for your implementation of pipeline parallel for llama model training.
I have encountered hang when run 7b training in 4xA40 training machine.
can you give me a Dockerfile than can running in some machine?

TypeError: 'NoneType' object is not subscriptable

deling_llama.py:134 in apply_rotary_pos_emb โ”‚
โ”‚ โ”‚
โ”‚ 131 โ”‚
โ”‚ 132 โ”‚
โ”‚ 133 def apply_rotary_pos_emb(q, k, cos, sin, position_ids): โ”‚
โ”‚ โฑ 134 โ”‚ gather_indices = position_ids[:, None, :, None] # [bs, 1, seq_len โ”‚
โ”‚ 135 โ”‚ gather_indices = gather_indices.repeat(1, cos.shape[1], 1, cos.sha โ”‚
โ”‚ 136 โ”‚ cos = torch.gather(cos.repeat(gather_indices.shape[0], 1, 1, 1), 2 โ”‚
โ”‚ 137 โ”‚ sin = torch.gather(sin.repeat(gather_indices.shape[0], 1, 1, 1), 2 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
TypeError: 'NoneType' object is not subscriptable
[2023-04-13 11:32:44,508] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 197
[2023-04-13 11:32:44,508] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 198
[2023-04-13 11:32:47,255] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 199
[2023-04-13 11:32:49,894] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 200
@HuangLK do you know how this happend and how to solve it ?

size mismatch

image
ไฝ ๅฅฝ๏ผŒไฝฟ็”จcovert2ckpt.py่ฝฌๆขๅŽ็š„ๆจกๅž‹embedding sizeไผšๅŠ ๅคง1ๆ˜ฏๅ—๏ผŸ้œ€่ฆไฟฎๆ”นๅฏนๅบ”config.jsonไธญ็š„vocab_size?

lossๅพˆๅฟซ้™ไธบ0

image

ๅœจ่ฎญ็ปƒๆ—ถ๏ผŒlossๅพˆๅฟซ้™ไธบ0๏ผŒ้…็ฝฎไธ€ๆ ท ไธ‡ๅˆ†ๆ„Ÿ่ฐข๏ผšD

RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn

Hi, wonderful work!

I didn't use your code but I following your code to implement my own llama-pipeline parallelism. But I'm encountering the following problem. May I know if you have encountered similar problems? I have no ideas about the solution.

Thanks for your help very much!

The error message:

Traceback (most recent call last)
  File "/home/fangkai/merit-v2/trainer_base_ds_mp.py", line 418, in <module>
    main() 
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/main.py", line 90, in decorated_main
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app( 
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 453, in <lambda>                                                 
    lambda: hydra.run(
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run                                                                          
    _ = ret.return_value 
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value                                                                      
    raise self._return_value
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job                                                                           
    ret.return_value = task_function(task_cfg)
  File "/home/fangkai/merit-v2/trainer_base_ds_mp.py", line 352, in main                                                                                                                    
    global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step)                                                                                                          
  File "/home/fangkai/merit-v2/trainer_base_ds_mp.py", line 212, in train                                                                                                                   
    loss = model.train_batch(sub_train_dataloader)                                                                                                                                          
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch 
    self._exec_schedule(sched) 
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule 
    self._exec_instr(**cmd.kwargs)
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 733, in _exec_backward_pass
    torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)                                                                                                                 
  File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward                                                                   
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                          
RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn

Here is a toy dataset:

class TestDataset(Dataset):
    def __init__(self, file_path, tokenizer):
        super().__init__()
        self.data = ["My name is Jiao Fangkai."]

    def __len__(self):
        return 100000000

    def __getitem__(self, index):
        return {"flan": {
            "inputs": self.data[0],
            "targets": self.data[0],
        }}

Here is the collator:

def vanilla_seq2seq_convertor(examples, tokenizer: PreTrainedTokenizer, max_seq_length, decoder_only: bool = False):
    inputs = []
    outputs = []
    for exp in examples:
        inputs.append(exp["inputs"])
        if decoder_only:
            outputs.append(exp["inputs"] + " " + exp["targets"] + tokenizer.eos_token)
        else:
            outputs.append(exp["targets"])

    model_inputs = tokenizer(inputs, text_target=outputs, max_length=max_seq_length, padding="longest",
                             truncation=True, return_tensors="pt")
    if decoder_only:
        input_lens = model_inputs["input_ids"].ne(tokenizer.pad_token_id).sum(dim=1)
        model_inputs = tokenizer(outputs, max_length=max_seq_length, padding="longest",
                                 truncation=True, return_tensors="pt")
        new_input_lens = model_inputs["input_ids"].ne(tokenizer.pad_token_id).sum(dim=1)
        input_lens = input_lens - input_lens.eq(new_input_lens).to(input_lens.dtype) * (input_lens // 2)
        input_lens = input_lens.to(torch.long)
        model_inputs["input_lens"] = input_lens

    return model_inputs

def get_lm_labels(input_lens, input_ids, pad_token_id):
    labels = input_ids.clone()

    label_mask = labels.ne(pad_token_id)
    lens_mask = torch.arange(labels.size(1))[None, :] >= input_lens[:, None]
    label_mask = label_mask & lens_mask

    labels = labels.masked_fill(~label_mask, -100).contiguous()

    return labels

class FlanCollatorOverCollator:
    def __init__(self, tokenizer: str, max_seq_length: int, decoder_only: bool = False):
        self.tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(tokenizer, use_fast=False)
        expand_special_tokenizer(self.tokenizer)
        self.max_seq_length = max_seq_length
        self.decoder_only = decoder_only

    def __call__(self, batch):
        flan_batch = []
        for item in batch:
            flan_batch.append(item.pop("flan"))

        model_inputs = vanilla_seq2seq_convertor(flan_batch, self.tokenizer, self.max_seq_length, self.decoder_only)

        
        # Add suffix `input_ids` to tackle the deepspeed logic.
        seq_length = model_inputs["input_ids"].size(1)
        position_ids = torch.arange(0, seq_length, dtype=torch.long)
        position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
        return (
                (
                    model_inputs["input_ids"],
                    model_inputs["attention_mask"],
                    # position_ids,
                    # model_inputs["input_lens"],
                    # model_inputs["input_ids"].detach().clone()
                ),
                # model_inputs["input_ids"].detach().clone()
                get_lm_labels(model_inputs["input_lens"], model_inputs["input_ids"], self.tokenizer.pad_token_id)
        )

        return model_inputs

And the initialization:

topo = PipeModelDataParallelTopology(num_pp=4, num_mp=1, num_dp=1)
model = PipelineModule(layers=layers,
                           # num_stages=cfg.num_stages,
                           topology=topo,
                           loss_fn=models.llama_ds_mp_wrap.loss_fn,
                           activation_checkpoint_interval=getattr(cfg, "activation_checkpoint_interval", 0))

hidden_states=boolๅ˜้‡

ๅคงไฝฌๅฅฝ๏ผๆˆ‘่ฟ่กŒๅ‡บ้”™ๅŽ
image
ไพฟ็›ดๆŽฅไฟฎๆ”นไบ†attention_mask=None.็ป“ๆžœๅˆๅ‡บ็Žฐไบ†ไปฅไธ‹้”™่ฏฏ
image
ๆ‰“ๅฐๅ˜้‡ๅ‘็Žฐๆ˜ฏboolๅž‹ๅ˜้‡๏ผŒๅฏผ่‡ดๅคฑ่ดฅ๏ผŒๅคงไฝฌ็Ÿฅ้“ๆ˜ฏไป€ไนˆๅŽŸๅ› ไธ๏ผŸ
image
image

attention mask

ๅœจfeeder.py้‡Œ็ป™ๆจกๅž‹ๆไพ›็š„ๆ˜ฏๅ› ๆžœmask๏ผŒไฝ†ๆ˜ฏๆฒกๆœ‰ๆไพ›pad mask๏ผŒ่ฟ™ไธชๅœฐๆ–นไผผไนŽ้œ€่ฆๆ”น่ฟ›ไธ€ไธ‹ใ€‚

File not found error

Hi Huang, nice work!

when I tried to train with a 13B model, I got the error:
[Errno 2] No such file or directory: 'llama_13b_pp/global_step001/zero_pp_rank_0_mp_rank_03_optim_states.pt'

Any ideas on this? The 'convert2ckpt.py' script does not generate files with prefix 'zero_pp_....'

Output is not getting saved

I have tried finetuning LLaMa 30B on an A100 with 2 GPUs with 80 GB each. The script got completed running in 5 min and there is no output generated. I couldn't find any error as well.

image

The command used to run the script:

deepspeed --include A1:0,1 --master_port 22384 train.py --output_dir output --init_ckpt /root/llama-30b-init-ckpt/ --data_path /root/alpaca_deepspeed.json --max_seq_len 1024 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 2 --model_parallel_size 1 --use_flash_attn true --deepspeed_config ./configs/ds_config_zero1.json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.