Comments (8)
I also encounter the same situation (after I modified some part in the model). The stuck usually happens in 4th or 5th epoch. One GPU is 0% and all others are 100%. Actually, in my case, it is because in one GPU process, the loss becomes NaN and the process automatically exits, which causes other GPU processes waiting for this one.
The current log output is allowed from the main GPU process (set in utils.py - setup_for_distributed()
), which ignores the message from other GPU processes. I would suggest enable the log output for all processes. For example:
- Add
setup_for_distributed2()
inutils.py
:
def setup_for_distributed2(rank):
import builtins as __builtin__
builtin_print = __builtin__.print
def print(*args, **kwargs):
builtin_print('[RANK:{}]'.format(rank), *args, **kwargs)
__builtin__.print = print
- Modify the last line in
init_distributed_mode()
inutils.py
def init_distributed_mode(args):
if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
args.rank = int(os.environ["RANK"])
args.world_size = int(os.environ['WORLD_SIZE'])
args.gpu = int(os.environ['LOCAL_RANK'])
elif 'SLURM_PROCID' in os.environ:
args.rank = int(os.environ['SLURM_PROCID'])
args.gpu = args.rank % torch.cuda.device_count()
else:
print('Not using distributed mode')
args.distributed = False
return
args.distributed = True
torch.cuda.set_device(args.gpu)
args.dist_backend = 'nccl'
print('| distributed init (rank {}): {}'.format(
args.rank, args.dist_url), flush=True)
torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
world_size=args.world_size, rank=args.rank)
torch.distributed.barrier()
# setup_for_distributed(args.rank == 0)
setup_for_distributed2(args.rank)
In my case, it outputs:
....
[RANK:0] Epoch: [5] [ 40/625] eta: 0:06:16 lr: 0.001600 loss: 6.4567 (6.3515) time: 0.5932 data: 0.0003 max mem: 8691
[RANK:2] Loss is nan, stopping training
from deit.
Hi,
Thanks for trying out DeiT.
Given that all other GPUs seem to be spinning at 100%, I suspect that there was a deadlock happening somewhere.
Can you confirm which PyTorch version you are using? Normally debugging deadlocks on distributed are pretty hard, and I can't reproduce the failure in here.
from deit.
Thanks
I use PyTorch 1.7.0 and torchvision 0.8.1 and timm 0.3.2
from deit.
Thanks.
So here is what I generally do when I debug such errors:
- find the process id of the job (via htop for example)
- attach gdb to the process via
gdb attach <pid>
- get backtrace with
thread apply all bt
You should have a bit stack trace, can you paste it in here?
from deit.
I also encounter the same problem after modifying the model. Training DeiT with patch size 32 may reproduce the problem. In addition to logging errors of all processes, automatically shutting down all other processes would be great.
from deit.
Hi,
I also encounter the same error when running DDP with one GPU comes into deadlock. Not sure what is the reason for that.
from deit.
Hi @pawopawo , I find you close this issue, have you solved it? In my case, I trained DeiT-base and I also encounter this error. Would you please share your workaround?
from deit.
Hi @pawopawo , I find you close this issue, have you solved it? In my case, I trained DeiT-base and I also encounter this error. Would you please share your workaround?
set new random seed for DistributedSampler
from deit.
Related Issues (20)
- What are the hyperparameters for DeiT-III (epoch 400 or 600)?
- The ablation experiment of DeiT HOT 2
- how to implement cosub training use deit-III
- how to implement cosub training use deit-III HOT 2
- DeiT depth 24 (CaiT - TABLE 1) HOT 2
- ImageNet21K data preparation for pre-training HOT 5
- batch_size flag HOT 2
- Code for cosub
- How to launch a training of CAIT models ?
- TracerWarning
- Hi,Why can't I find deit_tiny_distilled_patch16_224 in hubconf
- Checkpoints of IN21K pretrained deit III
- ViT-B Training for DeiT HOT 2
- Slow Training HOT 2
- random.seed(seed) in line 205 is commented
- Inclusion of Transformers Need Registers
- Training
- Question about different seeds per gpu with DDP
- Gradient accumulation code
- Will you be releasing the accuracy of the official deit III framework trained tiny version on IN1k?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deit.