Comments (10)
Can you suggest why you choose the same learning rate for all configuration of GPUs? From my previous experience, if you keep the learning rate unchanged while increase the batch size, it is equivalent to lower the learning rate.
from detr.
Do I need to double the learning rate and learning rate backbone when extending 8GPU to 16 GPU?
No, we have been using the same learning rates for all configurations of GPUs
About your issues with submitit, I suggest opening an issue in the submitit repo
from detr.
Hi,
As written in the readme, the recommended way to do that is to use submitit. Once you've pip installed it, simply run:
python run_with_submitit.py --timeout 3000 --coco_path /path/to/coco --nodes 2 --ngpus 8
I believe I have answered your question, and as such I'm closing this issue.
from detr.
Hi
I have some problems with submit it in my cluster.
Can you directly provide the command line like this?
srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path
Currently, the above-mentioned code can train model with 1 node and 8 gpus in my cluster.
from detr.
As I said in the other issues, I run into the following error.
python run_with_submitit.py --ngpus 1 --nodes 1 --timeout 360 --batch_size 2 --no_aux_loss --eval --resume ./saved_model/detr-r50-e632da11.pth --coco_path ../../dataset/
When I run your code, I run into the following error.
submitit INFO (2020-06-02 21:34:32,188) - Starting with JobEnvironment(job_id=587603, hostname=SH-IDC1-10-198-6-145, local_rank=0(1), node=0(1), global_rank=0(1))
submitit INFO (2020-06-02 21:34:32,188) - Loading pickle: experiments/587603/587603_submitted.pkl
Process group: 1 tasks, rank: 0
submitit ERROR (2020-06-02 21:35:20,838) - Submitted job triggered an exception
~
The system only returns submitted job triggered an exception with no other feedback.
from detr.
I tried the following:
python run_with_submitit.py --timeout 3000 --coco_path ... --ngpus 8 --nodes 2
get the following error
submitit.core.utils.FailedJobError: sbatch: unrecognized option '--gpus-per-node=8'
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
from detr.
I use the follwoing way for multi node training.
srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=".........." --master_port=1234 --use_env main.py --coco_path ........ --num_workers 6 --output_dir .....
srun --gres gpu:8 -c 32 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=".........." --master_port=1234 --use_env main.py --coco_path ....... --num_workers 6 --output_dir ......
from detr.
Do I need to double the learning rate and learning rate backbone when extending 8GPU to 16 GPU?
from detr.
we didn't scale the learning rate for our experiments, we found out that by using Adam it was ok to use the same default values for all configurations (even if using 64 GPUs).
The linear scaling rule is definitely too aggressive, and the model will probably not train at all with it. If you want to try some scaling rule for the learning rate, using the square-root scaling could potentially work (so increase batch size by 2, multiply learning rate by sqrt(2))
Answer from the author
from detr.
Hi,
I've read this issue, and from the very last comment,
The linear scaling rule is definitely too aggressive, and the model will probably not train at all with it.
it seems like the learning rate is very sensitive for the training even using AdamW optimizer, and the learning rate has to be fixed even if the number of images per batch increases.
I am currently using 8 V100, each with 16gb, hence should decrease the number of images per batch to at maximum 32.
Should I keep the learning rate the same even though the images per batch is fewer than the default setting? or should I lower the lr?
from detr.
Related Issues (20)
- Question about object queries. HOT 4
- I want to train the DETR model on a CPU. How can I make it possible on a small computer, 8gb RAM HOT 3
- Why positional encoding is added to different role in encoder and decoder. HOT 1
- 🐛 Bug: Architecture diagram in README.md renders incorrectly when using dark mode
- continue training with chekckpoint
- How to finetune DETR for semantic segmentation task?
- I do not understand what the mask meaning in "samlpes"
- Process finished with exit code 137 (interrupted by signal 9: SIGKILL)Please read & provide the following
- Very low performance for segmentation task.
- box_cxcywh_to_xyxy
- ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 257736) of binary: /home/public/anaconda3/envs/DL/bin/python
- Average Precision of each class for best epoch and then it's mean HOT 1
- the mAP is chage
- I think there are some errors in the posted code HOT 6
- Queries for images with low number of objects HOT 2
- RuntimeError: Error(s) in loading state_dict for DETRsegm: HOT 2
- Map metrics anomalies after backbone replacement
- when the trained model is used for inference this import error comes: RuntimeError: Failed to import transformers.models.detr.modeling_detr because of the following error (look up to see its traceback): cannot import name 'experimental_functions_run_eagerly' from 'tensorflow.python.eager.def_function' (C:\Anaconda\lib\site-packages\tensorflow\python\eager\def_function.py)
- Get Image masks coordinates.
- GFLOPs instead of GFLOPS?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from detr.