对于分布式训练方式无法在单卡上进行出现error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19068) of binary错误 about unimatch HOT 6 CLOSED

liheyoung commented on September 24, 2024

对于分布式训练方式无法在单卡上进行出现error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19068) of binary错误

from unimatch.

Comments (6)

LiheYoung commented on September 24, 2024

Please check your PyTorch version. Your PyTorch version may be 2.0, so it passes local-rank to the launching script by default, instead of our local_rank. You can use PyTorch 1.12 or 1.13.

Update by feedback from @SDivakarBhat: PyTorch 1.12 is required for our local_rank argument. When using PyTorch 1.13, you may need to change the local_rank to local-rank in the training script argument.

from unimatch.

SDivakarBhat commented on September 24, 2024

Please check your PyTorch version. Your PyTorch version may be 2.0, so it passes local-rank to the launching script by default, instead of our local_rank. You can use PyTorch 1.12 or 1.13.

First of all thank you for providing access to such a great work!
I am facing the same issue even though my PyTorch version is 1.13. I have tried running it even using the version 1.12.
When I change the loss for labeled set (criterion_l) to CELoss the code starts running for a couple of epochs but then fails again with the same issue.
When the criterion_l is not calculated, the code runs and finishes smoothly. ( this was done for just debug purpose as this obviously then makes the results useless.)
Would be great if this can be resolved.

from unimatch.

LiheYoung commented on September 24, 2024

@SDivakarBhat Hi, Thank you for your feedback. Could I know what you mean by "When I change the loss for labeled set (criterion_l) to CELoss the code starts running for a couple of epochs but then fails again with the same issue." because we already use the CELoss as labeled loss by default on Pascal. Do you mean changing the OHEM loss to CELoss on Cityscapes?

And can you provide more details about "the same issue"? Do you mean the local-rank error? It is very strange because this problem may only appear at the very start when launching the script.

from unimatch.

SDivakarBhat commented on September 24, 2024

@SDivakarBhat Hi, Thank you for your feedback. Could I know what you mean by "When I change the loss for labeled set (criterion_l) to CELoss the code starts running for a couple of epochs but then fails again with the same issue." because we already use the CELoss as labeled loss by default on Pascal. Do you mean changing the OHEM loss to CELoss on Cityscapes?

And can you provide more details about "the same issue"? Do you mean the local-rank error? It is very strange because this problem may only appear at the very start when launching the script.

Hi thank you for your quick response.
Yes I am using cityscapes. I think my error is not due to the local rank problem it seems to be appearing randomly in between epochs. Below is the part of the error, it would be of great help if you can shed some light on the possible reason.

File "unimatch.py", line 267, in │
main() │
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper │
return f(*args, **kwargs) │
File "unimatch.py", line 190, in main │
loss_u_s1 = loss_u_s1.sum() / (ignore_mask_cutmixed1 != 255).sum().item() │
RuntimeError: CUDA error: an illegal memory access was encountered

from unimatch.

LiheYoung commented on September 24, 2024

In most cases, this is because your groundtruth masks contain some values that are larger than your model output dimensions (classes). I think your labeled masks are incorrect, since the script can be finished when the labeled loss is removed. You can use torch.unique or np.unique to check the GT masks. Btw, whether you use our provided Cityscapes masks or the official masks?

from unimatch.

SDivakarBhat commented on September 24, 2024

In most cases, this is because your groundtruth masks contain some values that are larger than your model output dimensions (classes). I think your labeled masks are incorrect, since the script can be finished when the labeled loss is removed. You can use torch.unique or np.unique to check the GT masks. Btw, whether you use our provided Cityscapes masks or the official masks?

Thank you for the response. I am using the official masks. I think I have resolved the issue.

from unimatch.

对于分布式训练方式无法在单卡上进行出现error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19068) of binary错误 about unimatch HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent