Comments (19)
You can specify GPU ids with CUDA_VISIBLE_DEVICES
. For example CUDA_VISIBLE_DEVICES=4,5,6,7 pods_train --num-gpus 4
, it will use the last 4 GPUs for training. You may need to adjust the warmup iterations and warmup factor when you use fewer GPUs for training.
from yolof.
I added statements:os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1' in the train_net script。When performing train_net script training, Report an error:
Default process group is not initialized
How to solve it?
And the default batch_ size is 4, I use two 3090 and the memory is 24G to train, how to modify the size of the batch size?
from yolof.
oh,I know that I need to modify the IMS_PER_BATCH and IMS_PER_DEVICE parameter in the config script to change its batch_size.
But, for the training of two 3090 graphics cards, I will change WARMUP_FACTOR and WARMUP_ITERS parameters should be ?
from yolof.
When you use two GPUs, the error Default process group is not initialized
should not show up.
For changing the WARMUP_FACTOR
and WARMUP_ITERS
:
WARMUP_ITERS = 1500 * 8 / NUM_GPUS
WARMUP_FACTOR = 1. / WARMUP_ITERS
from yolof.
I have now modified the corresponding parameters in the config script, but run train_ net script still reports an error:
Default process group is not initialized
from yolof.
Traceback (most recent call last):
File "train_net.py", line 106, in
launch(
File "/media/data/huzhen/YOLOF-torch/cvpods/engine/launch.py", line 56, in launch
main_func(*args)
File "train_net.py", line 96, in main
runner.train()
File "/media/data/huzhen/YOLOF-torch/cvpods/engine/runner.py", line 270, in train
super().train(self.start_iter, self.start_epoch, self.max_iter)
File "/media/data/huzhen/YOLOF-torch/cvpods/engine/base_runner.py", line 84, in train
self.run_step()
File "/media/data/huzhen/YOLOF-torch/cvpods/engine/base_runner.py", line 185, in run_step
loss_dict = self.model(data)
File "/home/hz/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/data/huzhen/YOLOF-torch/playground/detection/coco/yolof/yolof_base/yolof.py", line 133, in forward
losses = self.losses(
File "/media/data/huzhen/YOLOF-torch/playground/detection/coco/yolof/yolof_base/yolof.py", line 211, in losses
dist.all_reduce(num_foreground)
File "/home/hz/anaconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 953, in all_reduce
_check_default_pg()
File "/home/hz/anaconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
assert _default_pg is not None,
AssertionError: Default process group is not initialized
from yolof.
Could you provide more details about your command for training?
from yolof.
I am using the train_net script under tools folder for training, Some parameters in the config script are adjusted, including IMS_PER_BATCH, IMS_PER_DEVICE, WARMUP_FACTOR and WARMUP_ITERS parameters。And add extra statement in the train_net script : os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'.
And update the path of Dataset in the base_dataset script.
Other default parameters and hyper-paramters dont change.
from yolof.
You need to add --num-gpus
to your command when you train with yolof.
BTW, we recommend using pods_train
as given in README.
from yolof.
Now there is a new error in the 'dist URL' parameter:
cvpods.engine.launch ERROR: Process group URL: tcp://127.0.0.1:50147
RuntimeError: Address already in use
ai...Your code actually is too hard to run。。。。
from yolof.
Why not just follow the steps in README. It should work well.
from yolof.
Using the method in REDEME to train, it can only modify the number of GPUs, but it definitely can't update the identifier of GPU to train at all.
from yolof.
It can.... I give an exmaple above.
You can specify GPU ids with
CUDA_VISIBLE_DEVICES
. For exampleCUDA_VISIBLE_DEVICES=4,5,6,7 pods_train --num-gpus 4
, it will use the last 4 GPUs for training. You may need to adjust the warmup iterations and warmup factor when you use fewer GPUs for training.
from yolof.
Ok,I konw. Take 2 GPUs for training , it still report error :
assert base_world_size == 8, "IMS_PER_BATCH/DEVICE in config file is used for 8 GPUs"
AssertionError: IMS_PER_BATCH/DEVICE in config file is used for 8 GPUs
The number of GPUs required by your code is too large. My team only has 4 GPUs per machine,I don't think I can train.....ai....
from yolof.
I useing 4 GPUs for training with the way you provided, like this:
CUDA_VISIBLE_DEVICES=0,1,2,3 pods_train --num-gpus 4
But it still report a error :
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
How could I solve it ? Thanks !
from yolof.
Many reasons can produce this error. You can refer to this solution and have a try.
from yolof.
OK,I trying to see if I could work it out. Thanks !
from yolof.
这个代码太难跑了
from yolof.
这个代码太难跑了
是的,很难跑,他是与基于cvpods库实现的, 需要安装这个库然后编译这个库,然后在源码中还要编译。而且最少要四张卡才能跑,非常吃显卡。。。之前我试了4张2080ti跑,结果还是报错,也就是上面个的error。难定,不想train这个代码了,其实这篇论文的encoder部分倒是可以学习的,其他的地方我懒得花时间了。。还得跑自己的实验,唉。。。
from yolof.
Related Issues (20)
- How can i run this project in a single gpu HOT 1
- How to convert your moder to onnx? HOT 2
- Save the file HOT 1
- Compatibility with anchor-free methods HOT 1
- label assignment problem HOT 1
- TypeError: __init__() missing 2 required positional arguments: 'cfg' and 'distributed' HOT 1
- How to understand the calculation process of normalized_cls_score? HOT 4
- PASCAL VOC results HOT 1
- How to get FLOPS and params of YOLOF? HOT 1
- Object detection w/ video HOT 1
- RuntimeError: Error compiling objects for extension HOT 1
- AssertionError: Box regression deltas become infinite or NaN! HOT 1
- about accuracy HOT 1
- About NAN during training HOT 1
- pods_train: command not found HOT 5
- there are some errors when I run pods_train HOT 1
- uniform matching how to choose k nearest anchors
- Small or occluded objects
- Question about loss function
- Can't download model HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yolof.