Hi, I got error below when I start to train model with this command <code class="notra

Thanks for your information. Several questions: <p dir="auto"

I got error when I start to train about ml-gmpi HOT 9 CLOSED

apple commented on August 20, 2024

I got error when I start to train

from ml-gmpi.

Comments (9)

Xiaoming-Zhao commented on August 20, 2024 1

Thanks for your information. Several questions:

Do you mind posting the full traceback? I need that to see the exact message. In your original message, the traceback ended on warnings.warn('resource_tracker: There appear to be %d. I guess there is something else after it. So please paste the entire traceback messages if possible.
For the following

But I also tried the second solution, and it seems to have solved the issue because it moved on to the next error

What is the new error message?

For the following

However, when I run the same command NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=5,6 python launch.py --run_dataset FFHQ256 --nproc_per_node 2 --task-type gmpi --run-type train master_port 8378, the same error sometimes occurs.

Since the NCCL_P2P_DISABLE trick works, then it is an NVLink issue with a large probability. Regarding this, I guess you need to specify CUDA_VISIBLE_DEVICES to the combinations starting with even numbers, e.g., 0,1, 2,3, 4,5, 6,7. NVLink has some weird setup that the combination starting with an odd number may fail.

from ml-gmpi.

Xiaoming-Zhao commented on August 20, 2024 1

Thanks a lot for the clarification.

Regarding the issue of pyspng, it may be due to the package itself instead of files since all files are png. If this is the case, this may help: #1 (comment)

Let me know if you encounter any new issues.

from ml-gmpi.

Xiaoming-Zhao commented on August 20, 2024

Do you mind posting the full traceback? And your running environment, e.g., PyTorch version, CUDA version, OS version, and GPU type, etc.

Thanks

from ml-gmpi.

Xiaoming-Zhao commented on August 20, 2024

BTW, is --nproc_per_node 0 a typo, or did you indeed call the command with it?

Based on the context, it seems like it should be --nproc_per_node 5 since you have CUDA_VISIBLE_DEVICES=2,3,4,5,6. Maybe this is the culprit.

from ml-gmpi.

parkjh688 commented on August 20, 2024

oh sorry that was wrong command. I've always used CUDA_VISIBLE_DEVICES=2,3,4,5,6 python launch.py --run_dataset FFHQ256 --nproc_per_node 2 --task-type gmpi --run-type train master_port 837

And now I'm using Docker.

torch version: 1.12.1
CUDA Version: 11.4
GPU type : NVIDIA A100
gcc version : 7.5.0

from ml-gmpi.

Xiaoming-Zhao commented on August 20, 2024

Thanks for the clarification. Do you mind posting the full traceback then?

Besides, to help debug, do you mind trying the followings:

Does it work if you only use one GPU, i.e., set --nproc_per_node 1?
If the above works, how about having NCCL_P2P_DISABLE=1, i.e., NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3,4,5,6 python launch.py ...?

Reasons: if I remember correctly, if the server has NVLink, there may be potential issues with DDP. So I am trying to see whether single-GPU works. And if it works, 2 is the trick I used before to resolve the issue.

from ml-gmpi.

parkjh688 commented on August 20, 2024

I tried the first one already but it didn't work.

But I also tried the second solution, and it seems to have solved the issue because it moved on to the next error. However, when I run the same command NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=5,6 python launch.py --run_dataset FFHQ256 --nproc_per_node 2 --task-type gmpi --run-type train master_port 8378, the same error sometimes occurs.

from ml-gmpi.

parkjh688 commented on August 20, 2024

Oh I want to show the full trace but the error is finished at the warnings.warn('resource_tracker: There appear to be %d.

The new error message is occurred by X = pyspng.load(f.read()) in dataset.py.
Maybe the image file is not a png or path problem something.

I didn't know that. Thanks! I have 8 GPU which is 0 - 7th. Then I'll try various GPU combinations.

from ml-gmpi.

parkjh688 commented on August 20, 2024

Thanks!

from ml-gmpi.

I got error when I start to train about ml-gmpi HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent