Giter Site home page Giter Site logo

Comments (10)

gaopengcuhk avatar gaopengcuhk commented on June 16, 2024 3

Can you suggest why you choose the same learning rate for all configuration of GPUs? From my previous experience, if you keep the learning rate unchanged while increase the batch size, it is equivalent to lower the learning rate.

from detr.

fmassa avatar fmassa commented on June 16, 2024 2

@gaopengcuhk

Do I need to double the learning rate and learning rate backbone when extending 8GPU to 16 GPU?

No, we have been using the same learning rates for all configurations of GPUs

About your issues with submitit, I suggest opening an issue in the submitit repo

from detr.

alcinos avatar alcinos commented on June 16, 2024

Hi,
As written in the readme, the recommended way to do that is to use submitit. Once you've pip installed it, simply run:
python run_with_submitit.py --timeout 3000 --coco_path /path/to/coco --nodes 2 --ngpus 8

I believe I have answered your question, and as such I'm closing this issue.

from detr.

gaopengcuhk avatar gaopengcuhk commented on June 16, 2024

Hi

I have some problems with submit it in my cluster.

Can you directly provide the command line like this?
srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path

Currently, the above-mentioned code can train model with 1 node and 8 gpus in my cluster.

from detr.

gaopengcuhk avatar gaopengcuhk commented on June 16, 2024

As I said in the other issues, I run into the following error.

python run_with_submitit.py --ngpus 1 --nodes 1 --timeout 360 --batch_size 2 --no_aux_loss --eval --resume ./saved_model/detr-r50-e632da11.pth --coco_path ../../dataset/

When I run your code, I run into the following error.

submitit INFO (2020-06-02 21:34:32,188) - Starting with JobEnvironment(job_id=587603, hostname=SH-IDC1-10-198-6-145, local_rank=0(1), node=0(1), global_rank=0(1))
submitit INFO (2020-06-02 21:34:32,188) - Loading pickle: experiments/587603/587603_submitted.pkl
Process group: 1 tasks, rank: 0
submitit ERROR (2020-06-02 21:35:20,838) - Submitted job triggered an exception
~

The system only returns submitted job triggered an exception with no other feedback.

from detr.

gaopengcuhk avatar gaopengcuhk commented on June 16, 2024

I tried the following:
python run_with_submitit.py --timeout 3000 --coco_path ... --ngpus 8 --nodes 2

get the following error
submitit.core.utils.FailedJobError: sbatch: unrecognized option '--gpus-per-node=8'
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

from detr.

gaopengcuhk avatar gaopengcuhk commented on June 16, 2024

I use the follwoing way for multi node training.

srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=".........." --master_port=1234 --use_env main.py --coco_path ........ --num_workers 6 --output_dir .....

srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=".........." --master_port=1234 --use_env main.py --coco_path ....... --num_workers 6 --output_dir ......

from detr.

gaopengcuhk avatar gaopengcuhk commented on June 16, 2024

Do I need to double the learning rate and learning rate backbone when extending 8GPU to 16 GPU?

from detr.

gaopengcuhk avatar gaopengcuhk commented on June 16, 2024

we didn't scale the learning rate for our experiments, we found out that by using Adam it was ok to use the same default values for all configurations (even if using 64 GPUs).

The linear scaling rule is definitely too aggressive, and the model will probably not train at all with it. If you want to try some scaling rule for the learning rate, using the square-root scaling could potentially work (so increase batch size by 2, multiply learning rate by sqrt(2))

Answer from the author

from detr.

sukjunhwang avatar sukjunhwang commented on June 16, 2024

Hi,
I've read this issue, and from the very last comment,

The linear scaling rule is definitely too aggressive, and the model will probably not train at all with it.

it seems like the learning rate is very sensitive for the training even using AdamW optimizer, and the learning rate has to be fixed even if the number of images per batch increases.
I am currently using 8 V100, each with 16gb, hence should decrease the number of images per batch to at maximum 32.
Should I keep the learning rate the same even though the images per batch is fewer than the default setting? or should I lower the lr?

from detr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.