Describe the bug Hi, I use zero-3 for MLLM training. After one-ep

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training about deepspeed HOT 9 OPEN

Coobiw commented on June 2, 2024

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training

from deepspeed.

Comments (9)

tjruwase commented on June 2, 2024

@Coobiw, you can you use the GatheredParameters context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of computing moving average of parameters here.

from deepspeed.

Coobiw commented on June 2, 2024

Hi, I've tried this before. But the program is stuck. How can I debug this?

And I want to know whether it is because I use 30B+ LLM and zero3 inference is very slow?

if self.zero_stage == 3:
                params_to_fetch = [
                    p for p in self.model.parameters()
                    if hasattr(p, 'ds_id') and p.ds_status == deepspeed.zero.partition_parameters.ZeroParamStatus.NOT_AVAILABLE
                ]
                should_gather_param = len(params_to_fetch) > 0
                with deepspeed.zero.GatheredParameters(params_to_fetch, enabled=should_gather_param):
                    self.model.eval()
                    evaluation() # contain model.generate()

from deepspeed.

tjruwase commented on June 2, 2024

@Coobiw, can you share your full script to help us repro on our side?

Is this a dense or MoE model?

In terms of debugging, can you use prints to pin-point the hang point?

Also, can you try to repro on single gpu so that you can use pdb for debugging. You can try two options for this:

Enable cpu/nvme offloading to fit the model, or
Use smaller model

from deepspeed.

Coobiw commented on June 2, 2024

Sorry, it is inconvenient to share the whole code. I would try my best to provide more information. It is a dense model. I've tried the script on my ~9B model on A100 80GB. Similar stuck appeared.

I think it may be a multi-gpu communication problem? No explicit bug. Only a warning in model.generate which is related with NCCL.

/root/miniconda3/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
t-20240517175036-k966t-worker-0:5136:5282 [7] ib_plugin.c:798 NCCL WARN NET/IB : req 0/1 tag 7 peer 172.25.40.117<36987> collective mismatch error, local size 897024 remote size 614400
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO transport/net.cc:990 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:679 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:858 -> 5 [Proxy Thread]

I guess collective mismatch error, local size 897024 remote size 614400 causes the stuck.

Additionally, my env is as following:

deepspeed == 0.14.0
cuda: 11.8

The output of nvcc -V is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

from deepspeed.

Coobiw commented on June 2, 2024

after double check, I find another error message on one worker. as following（time-out error probably）:

[E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
t-20240517230118-grg2t-worker-1:5123:5271 [0] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5125:5269 [2] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5127:5272 [4] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5129:5270 [6] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5275 [7] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5206 [0] NCCL INFO comm 0x738ea950 rank 15 nranks 64 cudaDev 7 busId e4000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'

from deepspeed.

Coobiw commented on June 2, 2024

hi, I also test this in one node(8 x A100) with one 9B model. Stuck appeared. TAT

from deepspeed.

tjruwase commented on June 2, 2024

Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm

from deepspeed.

Coobiw commented on June 2, 2024

Oh, thanks, I get it. Do you have any suggestion about this? I think I've done left-padding. How to ensure the output length?

from deepspeed.

tjruwase commented on June 2, 2024

@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?

from deepspeed.

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training about deepspeed HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent