Hi, the code will always stay <a href="https://github.com/facebookresearch/deit/blob/3

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

multi node training about deit HOT 8 CLOSED

facebookresearch commented on August 15, 2024

multi node training

from deit.

Comments (8)

xwjabc commented on August 15, 2024 8

I also encounter the same situation (after I modified some part in the model). The stuck usually happens in 4th or 5th epoch. One GPU is 0% and all others are 100%. Actually, in my case, it is because in one GPU process, the loss becomes NaN and the process automatically exits, which causes other GPU processes waiting for this one.

The current log output is allowed from the main GPU process (set in utils.py - setup_for_distributed() ), which ignores the message from other GPU processes. I would suggest enable the log output for all processes. For example:

Add setup_for_distributed2() in utils.py:

def setup_for_distributed2(rank):
    import builtins as __builtin__
    builtin_print = __builtin__.print

    def print(*args, **kwargs):
        builtin_print('[RANK:{}]'.format(rank), *args, **kwargs)

    __builtin__.print = print

Modify the last line in init_distributed_mode() in utils.py

def init_distributed_mode(args):
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return

    args.distributed = True

    torch.cuda.set_device(args.gpu)
    args.dist_backend = 'nccl'
    print('| distributed init (rank {}): {}'.format(
        args.rank, args.dist_url), flush=True)
    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                         world_size=args.world_size, rank=args.rank)
    torch.distributed.barrier()
    # setup_for_distributed(args.rank == 0)
    setup_for_distributed2(args.rank)

In my case, it outputs:

....
[RANK:0] Epoch: [5]  [ 40/625]  eta: 0:06:16  lr: 0.001600  loss: 6.4567 (6.3515)  time: 0.5932  data: 0.0003  max mem: 8691
[RANK:2] Loss is nan, stopping training

from deit.

fmassa commented on August 15, 2024

Hi,

Thanks for trying out DeiT.

Given that all other GPUs seem to be spinning at 100%, I suspect that there was a deadlock happening somewhere.

Can you confirm which PyTorch version you are using? Normally debugging deadlocks on distributed are pretty hard, and I can't reproduce the failure in here.

from deit.

pawopawo commented on August 15, 2024

Thanks

I use PyTorch 1.7.0 and torchvision 0.8.1 and timm 0.3.2

from deit.

fmassa commented on August 15, 2024

Thanks.

So here is what I generally do when I debug such errors:

find the process id of the job (via htop for example)
attach gdb to the process via gdb attach <pid>
get backtrace with thread apply all bt

You should have a bit stack trace, can you paste it in here?

from deit.

changlin31 commented on August 15, 2024

I also encounter the same problem after modifying the model. Training DeiT with patch size 32 may reproduce the problem. In addition to logging errors of all processes, automatically shutting down all other processes would be great.

from deit.

zhoudaquan commented on August 15, 2024

Hi,

I also encounter the same error when running DDP with one GPU comes into deadlock. Not sure what is the reason for that.

from deit.

HubHop commented on August 15, 2024

Hi @pawopawo , I find you close this issue, have you solved it? In my case, I trained DeiT-base and I also encounter this error. Would you please share your workaround?

from deit.

pawopawo commented on August 15, 2024

Hi @pawopawo , I find you close this issue, have you solved it? In my case, I trained DeiT-base and I also encounter this error. Would you please share your workaround?

set new random seed for DistributedSampler

from deit.

multi node training about deit HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent