how to train and test? about fast-bev HOT 13 CLOSED

sense-gvt commented on August 30, 2024

how to train and test?

from fast-bev.

Comments (13)

hly2990 commented on August 30, 2024 1

Thank you for your answer. But, while running pytorch code on multi-GPUs in single server, my program didn't response for a long time with one setting --os.environ['WORLD_SIZE'] = '2'. So could you provide some code example running on one server rather than a cluster? thanks a lot!

from fast-bev.

ymlab commented on August 30, 2024

This script can help:
https://github.com/Sense-GVT/Fast-BEV/blob/dev/tools/fastbev_run.sh

from fast-bev.

hly2990 commented on August 30, 2024

Could you provide an example of starting train?

from fast-bev.

Rex-LK commented on August 30, 2024

I tried 2 ways of train
1、sh tools/fastbev_run.sh (slurm_train $PARTITION 1 fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4)

MMDET3D: /data/rex/BEV/Fast-BEV-dev
SRUN_ARGS: -s
basename: missing operand
Try 'basename --help' for more information.
tools/fastbev_run.sh: 16: arithmetic expression: division by zero: "fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4<8?fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4:8"

2、python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py
python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py
error: File "/home/snk/anaconda3/envs/fastbev/lib/python3.8/site-packages/nuscenes/nuscenes.py", line 225, in getind
return self._token2ind[table_name][token]
KeyError: '442c7729b9d0455ca75978f1a7fdab3a'

from fast-bev.

ymlab commented on August 30, 2024

The training script provided by this repo is based on slurm. You can see that two parameters need to be passed in the training script, namely PARTITION and QUOTATYPE(https://github.com/Sense-GVT/Fast-BEV/blob/dev/tools/fastbev_run.sh#L93
), which represent the specified training partition and resource allocation type respectively. Maybe sometime I can try to provide a dist-based training script(https://github.com/open-mmlab/mmdetection/tree/master/tools).

As a example, my general command to start the training script is:
sh ./tools/fastbev_run.sh Test reserved

from fast-bev.

ymlab commented on August 30, 2024

I tried 2 ways of train 1、sh tools/fastbev_run.sh (slurm_train $PARTITION 1 fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4)

MMDET3D: /data/rex/BEV/Fast-BEV-dev SRUN_ARGS: -s basename: missing operand Try 'basename --help' for more information. tools/fastbev_run.sh: 16: arithmetic expression: division by zero: "fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4<8?fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4:8"

2、python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py error: File "/home/snk/anaconda3/envs/fastbev/lib/python3.8/site-packages/nuscenes/nuscenes.py", line 225, in getind return self._token2ind[table_name][token] KeyError: '442c7729b9d0455ca75978f1a7fdab3a'

I am not sure about the specific reason for the second error.

from fast-bev.

Ignite616 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.

File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

from fast-bev.

ymlab commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

This might help: #18

from fast-bev.

hly2990 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18

I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4.
Maybe you can have a try.

from fast-bev.

Ignite616 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18

First of all, thank you for your reply. I refer to # 18 and use the 4th and 5th GPU cards for training. Two GPUs are currently in use, but only calculated on the 5th GPU. How can I use both GPUs to calculate?

from fast-bev.

Ignite616 commented on August 30, 2024

I also meet an error in the test.
I have successfully used m1 model for visualization on a single GPU. In order to see the best results of the fast-bev model, I intend to use the m5 model for visualization. The memory of a GPU is not enough to support the visualization of the m5 model.
So I refer to mmdection/tools/dist_ test.sh, to use two GPUs for visualization.
my dist_test.bash is as follows.

I run with bash tools/dist_test.sh configs/fastbev/exp/paper/fastbev_m5_r50_s512x1408_v250x250x6_c256_d6_f4.py download_model/m5_epoch_20.pth 2.
Then I meet an error, as follow.

please how to solve it? 24=4x6

from fast-bev.

Ignite616 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18
I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4. Maybe you can have a try.

Thank you for your prompt. I run with the command CUDA_VISIBLE_DEVICES=4,5 python -m torch.distributed.launch --nproc_per_node=2 tools/train.py configs/fastbev/exp/paper/fastbev_m5_r50_s512x1408_v250x250x6_c256_d6_f4.py --launcher="pytorch" --gpus 2 . But I received an error.
` raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

          tools/train.py FAILED

==================================================
Root Cause:
[0]:
time: 2023-03-05_20:42:07
rank: 1 (local_rank: 1)
exitcode: -11 (pid: 43458)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 43458"

Other Failures:
<NO_OTHER_FAILURES>
**************************************************`

Is this because I didn't start with GPU 0. How should I use the fourth and fifth GPUs? Because I don't have permission to use the 0 GPU

from fast-bev.

sinsin1998 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18
I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4. Maybe you can have a try.

Hello, I try to run this code with 3090, but it's really slow. And I think there may be something wrong with the dataloader. Could you tell me your device and how long the train this model for 20 epochs.

from fast-bev.

how to train and test? about fast-bev HOT 13 CLOSED

Comments (13)

==================================================
Root Cause:
[0]:
time: 2023-03-05_20:42:07
rank: 1 (local_rank: 1)
exitcode: -11 (pid: 43458)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 43458"

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (13)

================================================== Root Cause: [0]: time: 2023-03-05_20:42:07 rank: 1 (local_rank: 1) exitcode: -11 (pid: 43458) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 43458"

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

==================================================
Root Cause:
[0]:
time: 2023-03-05_20:42:07
rank: 1 (local_rank: 1)
exitcode: -11 (pid: 43458)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 43458"