Giter Site home page Giter Site logo

对于分布式训练方式无法在单卡上进行 出现error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19068) of binary错误 about unimatch HOT 6 CLOSED

liheyoung avatar liheyoung commented on September 24, 2024
对于分布式训练方式无法在单卡上进行 出现error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19068) of binary错误

from unimatch.

Comments (6)

LiheYoung avatar LiheYoung commented on September 24, 2024

Please check your PyTorch version. Your PyTorch version may be 2.0, so it passes local-rank to the launching script by default, instead of our local_rank. You can use PyTorch 1.12 or 1.13.

Update by feedback from @SDivakarBhat: PyTorch 1.12 is required for our local_rank argument. When using PyTorch 1.13, you may need to change the local_rank to local-rank in the training script argument.

from unimatch.

SDivakarBhat avatar SDivakarBhat commented on September 24, 2024

Please check your PyTorch version. Your PyTorch version may be 2.0, so it passes local-rank to the launching script by default, instead of our local_rank. You can use PyTorch 1.12 or 1.13.

First of all thank you for providing access to such a great work!
I am facing the same issue even though my PyTorch version is 1.13. I have tried running it even using the version 1.12.
When I change the loss for labeled set (criterion_l) to CELoss the code starts running for a couple of epochs but then fails again with the same issue.
When the criterion_l is not calculated, the code runs and finishes smoothly. ( this was done for just debug purpose as this obviously then makes the results useless.)
Would be great if this can be resolved.

from unimatch.

LiheYoung avatar LiheYoung commented on September 24, 2024

@SDivakarBhat Hi, Thank you for your feedback. Could I know what you mean by "When I change the loss for labeled set (criterion_l) to CELoss the code starts running for a couple of epochs but then fails again with the same issue." because we already use the CELoss as labeled loss by default on Pascal. Do you mean changing the OHEM loss to CELoss on Cityscapes?

And can you provide more details about "the same issue"? Do you mean the local-rank error? It is very strange because this problem may only appear at the very start when launching the script.

from unimatch.

SDivakarBhat avatar SDivakarBhat commented on September 24, 2024

@SDivakarBhat Hi, Thank you for your feedback. Could I know what you mean by "When I change the loss for labeled set (criterion_l) to CELoss the code starts running for a couple of epochs but then fails again with the same issue." because we already use the CELoss as labeled loss by default on Pascal. Do you mean changing the OHEM loss to CELoss on Cityscapes?

And can you provide more details about "the same issue"? Do you mean the local-rank error? It is very strange because this problem may only appear at the very start when launching the script.

Hi thank you for your quick response.
Yes I am using cityscapes. I think my error is not due to the local rank problem it seems to be appearing randomly in between epochs. Below is the part of the error, it would be of great help if you can shed some light on the possible reason.

File "unimatch.py", line 267, in │
main() │
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper │
return f(*args, **kwargs) │
File "unimatch.py", line 190, in main │
loss_u_s1 = loss_u_s1.sum() / (ignore_mask_cutmixed1 != 255).sum().item() │
RuntimeError: CUDA error: an illegal memory access was encountered

from unimatch.

LiheYoung avatar LiheYoung commented on September 24, 2024

In most cases, this is because your groundtruth masks contain some values that are larger than your model output dimensions (classes). I think your labeled masks are incorrect, since the script can be finished when the labeled loss is removed. You can use torch.unique or np.unique to check the GT masks. Btw, whether you use our provided Cityscapes masks or the official masks?

from unimatch.

SDivakarBhat avatar SDivakarBhat commented on September 24, 2024

In most cases, this is because your groundtruth masks contain some values that are larger than your model output dimensions (classes). I think your labeled masks are incorrect, since the script can be finished when the labeled loss is removed. You can use torch.unique or np.unique to check the GT masks. Btw, whether you use our provided Cityscapes masks or the official masks?

Thank you for the response. I am using the official masks. I think I have resolved the issue.

from unimatch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.