Giter Site home page Giter Site logo

multi node training about deit HOT 8 CLOSED

facebookresearch avatar facebookresearch commented on August 15, 2024
multi node training

from deit.

Comments (8)

xwjabc avatar xwjabc commented on August 15, 2024 8

I also encounter the same situation (after I modified some part in the model). The stuck usually happens in 4th or 5th epoch. One GPU is 0% and all others are 100%. Actually, in my case, it is because in one GPU process, the loss becomes NaN and the process automatically exits, which causes other GPU processes waiting for this one.

The current log output is allowed from the main GPU process (set in utils.py - setup_for_distributed() ), which ignores the message from other GPU processes. I would suggest enable the log output for all processes. For example:

  1. Add setup_for_distributed2() in utils.py:
def setup_for_distributed2(rank):
    import builtins as __builtin__
    builtin_print = __builtin__.print

    def print(*args, **kwargs):
        builtin_print('[RANK:{}]'.format(rank), *args, **kwargs)

    __builtin__.print = print
  1. Modify the last line in init_distributed_mode() in utils.py
def init_distributed_mode(args):
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return

    args.distributed = True

    torch.cuda.set_device(args.gpu)
    args.dist_backend = 'nccl'
    print('| distributed init (rank {}): {}'.format(
        args.rank, args.dist_url), flush=True)
    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                         world_size=args.world_size, rank=args.rank)
    torch.distributed.barrier()
    # setup_for_distributed(args.rank == 0)
    setup_for_distributed2(args.rank)

In my case, it outputs:

....
[RANK:0] Epoch: [5]  [ 40/625]  eta: 0:06:16  lr: 0.001600  loss: 6.4567 (6.3515)  time: 0.5932  data: 0.0003  max mem: 8691
[RANK:2] Loss is nan, stopping training

from deit.

fmassa avatar fmassa commented on August 15, 2024

Hi,

Thanks for trying out DeiT.

Given that all other GPUs seem to be spinning at 100%, I suspect that there was a deadlock happening somewhere.

Can you confirm which PyTorch version you are using? Normally debugging deadlocks on distributed are pretty hard, and I can't reproduce the failure in here.

from deit.

pawopawo avatar pawopawo commented on August 15, 2024

Thanks

I use PyTorch 1.7.0 and torchvision 0.8.1 and timm 0.3.2

from deit.

fmassa avatar fmassa commented on August 15, 2024

Thanks.

So here is what I generally do when I debug such errors:

  • find the process id of the job (via htop for example)
  • attach gdb to the process via gdb attach <pid>
  • get backtrace with thread apply all bt

You should have a bit stack trace, can you paste it in here?

from deit.

changlin31 avatar changlin31 commented on August 15, 2024

I also encounter the same problem after modifying the model. Training DeiT with patch size 32 may reproduce the problem. In addition to logging errors of all processes, automatically shutting down all other processes would be great.

from deit.

zhoudaquan avatar zhoudaquan commented on August 15, 2024

Hi,

I also encounter the same error when running DDP with one GPU comes into deadlock. Not sure what is the reason for that.

from deit.

HubHop avatar HubHop commented on August 15, 2024

Hi @pawopawo , I find you close this issue, have you solved it? In my case, I trained DeiT-base and I also encounter this error. Would you please share your workaround?

from deit.

pawopawo avatar pawopawo commented on August 15, 2024

Hi @pawopawo , I find you close this issue, have you solved it? In my case, I trained DeiT-base and I also encounter this error. Would you please share your workaround?

set new random seed for DistributedSampler

from deit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.