Can you share the code to run 16 GPUS over 2 nodes using slurm?

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi I have some problems with submit it in my cluster. <p dir="au

How to train 16 GPUs using slurm? about detr HOT 10 CLOSED

facebookresearch commented on August 25, 2024

How to train 16 GPUs using slurm?

from detr.

Comments (10)

gaopengcuhk commented on August 25, 2024 3

Can you suggest why you choose the same learning rate for all configuration of GPUs? From my previous experience, if you keep the learning rate unchanged while increase the batch size, it is equivalent to lower the learning rate.

from detr.

fmassa commented on August 25, 2024 2

@gaopengcuhk

Do I need to double the learning rate and learning rate backbone when extending 8GPU to 16 GPU?

No, we have been using the same learning rates for all configurations of GPUs

About your issues with submitit, I suggest opening an issue in the submitit repo

from detr.

alcinos commented on August 25, 2024

Hi,
As written in the readme, the recommended way to do that is to use submitit. Once you've pip installed it, simply run:
python run_with_submitit.py --timeout 3000 --coco_path /path/to/coco --nodes 2 --ngpus 8

I believe I have answered your question, and as such I'm closing this issue.

from detr.

gaopengcuhk commented on August 25, 2024

I have some problems with submit it in my cluster.

Can you directly provide the command line like this?
srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path

Currently, the above-mentioned code can train model with 1 node and 8 gpus in my cluster.

from detr.

gaopengcuhk commented on August 25, 2024

As I said in the other issues, I run into the following error.

python run_with_submitit.py --ngpus 1 --nodes 1 --timeout 360 --batch_size 2 --no_aux_loss --eval --resume ./saved_model/detr-r50-e632da11.pth --coco_path ../../dataset/

When I run your code, I run into the following error.

submitit INFO (2020-06-02 21:34:32,188) - Starting with JobEnvironment(job_id=587603, hostname=SH-IDC1-10-198-6-145, local_rank=0(1), node=0(1), global_rank=0(1))
submitit INFO (2020-06-02 21:34:32,188) - Loading pickle: experiments/587603/587603_submitted.pkl
Process group: 1 tasks, rank: 0
submitit ERROR (2020-06-02 21:35:20,838) - Submitted job triggered an exception
~

The system only returns submitted job triggered an exception with no other feedback.

from detr.

gaopengcuhk commented on August 25, 2024

I tried the following:
python run_with_submitit.py --timeout 3000 --coco_path ... --ngpus 8 --nodes 2

get the following error
submitit.core.utils.FailedJobError: sbatch: unrecognized option '--gpus-per-node=8'
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

from detr.

gaopengcuhk commented on August 25, 2024

I use the follwoing way for multi node training.

srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=".........." --master_port=1234 --use_env main.py --coco_path ........ --num_workers 6 --output_dir .....

srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=".........." --master_port=1234 --use_env main.py --coco_path ....... --num_workers 6 --output_dir ......

from detr.

gaopengcuhk commented on August 25, 2024

Do I need to double the learning rate and learning rate backbone when extending 8GPU to 16 GPU?

from detr.

gaopengcuhk commented on August 25, 2024

we didn't scale the learning rate for our experiments, we found out that by using Adam it was ok to use the same default values for all configurations (even if using 64 GPUs).

The linear scaling rule is definitely too aggressive, and the model will probably not train at all with it. If you want to try some scaling rule for the learning rate, using the square-root scaling could potentially work (so increase batch size by 2, multiply learning rate by sqrt(2))

Answer from the author

from detr.

sukjunhwang commented on August 25, 2024

Hi,
I've read this issue, and from the very last comment,

The linear scaling rule is definitely too aggressive, and the model will probably not train at all with it.

it seems like the learning rate is very sensitive for the training even using AdamW optimizer, and the learning rate has to be fixed even if the number of images per batch increases.
I am currently using 8 V100, each with 16gb, hence should decrease the number of images per batch to at maximum 32.
Should I keep the learning rate the same even though the images per batch is fewer than the default setting? or should I lower the lr?

from detr.

How to train 16 GPUs using slurm? about detr HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent