Giter Site home page Giter Site logo

I got error when I start to train about ml-gmpi HOT 9 CLOSED

apple avatar apple commented on August 20, 2024
I got error when I start to train

from ml-gmpi.

Comments (9)

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024 1

Thanks for your information. Several questions:

  1. Do you mind posting the full traceback? I need that to see the exact message. In your original message, the traceback ended on warnings.warn('resource_tracker: There appear to be %d. I guess there is something else after it. So please paste the entire traceback messages if possible.

  2. For the following

But I also tried the second solution, and it seems to have solved the issue because it moved on to the next error

What is the new error message?

  1. For the following

However, when I run the same command NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=5,6 python launch.py --run_dataset FFHQ256 --nproc_per_node 2 --task-type gmpi --run-type train master_port 8378, the same error sometimes occurs.

Since the NCCL_P2P_DISABLE trick works, then it is an NVLink issue with a large probability. Regarding this, I guess you need to specify CUDA_VISIBLE_DEVICES to the combinations starting with even numbers, e.g., 0,1, 2,3, 4,5, 6,7. NVLink has some weird setup that the combination starting with an odd number may fail.

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024 1

Thanks a lot for the clarification.

Regarding the issue of pyspng, it may be due to the package itself instead of files since all files are png. If this is the case, this may help: #1 (comment)

Let me know if you encounter any new issues.

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

Do you mind posting the full traceback? And your running environment, e.g., PyTorch version, CUDA version, OS version, and GPU type, etc.

Thanks

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

BTW, is --nproc_per_node 0 a typo, or did you indeed call the command with it?

Based on the context, it seems like it should be --nproc_per_node 5 since you have CUDA_VISIBLE_DEVICES=2,3,4,5,6. Maybe this is the culprit.

from ml-gmpi.

parkjh688 avatar parkjh688 commented on August 20, 2024

oh sorry that was wrong command. I've always used CUDA_VISIBLE_DEVICES=2,3,4,5,6 python launch.py --run_dataset FFHQ256 --nproc_per_node 2 --task-type gmpi --run-type train master_port 837

And now I'm using Docker.

  • torch version: 1.12.1
  • CUDA Version: 11.4
  • GPU type : NVIDIA A100
  • gcc version : 7.5.0

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

Thanks for the clarification. Do you mind posting the full traceback then?

Besides, to help debug, do you mind trying the followings:

  1. Does it work if you only use one GPU, i.e., set --nproc_per_node 1?
  2. If the above works, how about having NCCL_P2P_DISABLE=1, i.e., NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3,4,5,6 python launch.py ...?

Reasons: if I remember correctly, if the server has NVLink, there may be potential issues with DDP. So I am trying to see whether single-GPU works. And if it works, 2 is the trick I used before to resolve the issue.

from ml-gmpi.

parkjh688 avatar parkjh688 commented on August 20, 2024

I tried the first one already but it didn't work.

But I also tried the second solution, and it seems to have solved the issue because it moved on to the next error. However, when I run the same command NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=5,6 python launch.py --run_dataset FFHQ256 --nproc_per_node 2 --task-type gmpi --run-type train master_port 8378, the same error sometimes occurs.

from ml-gmpi.

parkjh688 avatar parkjh688 commented on August 20, 2024

Oh I want to show the full trace but the error is finished at the warnings.warn('resource_tracker: There appear to be %d.

The new error message is occurred by X = pyspng.load(f.read()) in dataset.py.
Maybe the image file is not a png or path problem something.

I didn't know that. Thanks! I have 8 GPU which is 0 - 7th. Then I'll try various GPU combinations.

from ml-gmpi.

parkjh688 avatar parkjh688 commented on August 20, 2024

Thanks!

from ml-gmpi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.