Giter Site home page Giter Site logo

how to train and test? about fast-bev HOT 13 CLOSED

sense-gvt avatar sense-gvt commented on August 30, 2024
how to train and test?

from fast-bev.

Comments (13)

hly2990 avatar hly2990 commented on August 30, 2024 1

Thank you for your answer. But, while running pytorch code on multi-GPUs in single server, my program didn't response for a long time with one setting --os.environ['WORLD_SIZE'] = '2'. So could you provide some code example running on one server rather than a cluster? thanks a lot!

from fast-bev.

ymlab avatar ymlab commented on August 30, 2024

This script can help:
https://github.com/Sense-GVT/Fast-BEV/blob/dev/tools/fastbev_run.sh

from fast-bev.

hly2990 avatar hly2990 commented on August 30, 2024

Could you provide an example of starting train?

from fast-bev.

Rex-LK avatar Rex-LK commented on August 30, 2024

I tried 2 ways of train
1、sh tools/fastbev_run.sh (slurm_train $PARTITION 1 fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4)

MMDET3D: /data/rex/BEV/Fast-BEV-dev
SRUN_ARGS: -s
basename: missing operand
Try 'basename --help' for more information.
tools/fastbev_run.sh: 16: arithmetic expression: division by zero: "fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4<8?fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4:8"

2、python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py
python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py
error: File "/home/snk/anaconda3/envs/fastbev/lib/python3.8/site-packages/nuscenes/nuscenes.py", line 225, in getind
return self._token2ind[table_name][token]
KeyError: '442c7729b9d0455ca75978f1a7fdab3a'

from fast-bev.

ymlab avatar ymlab commented on August 30, 2024

The training script provided by this repo is based on slurm. You can see that two parameters need to be passed in the training script, namely PARTITION and QUOTATYPE(https://github.com/Sense-GVT/Fast-BEV/blob/dev/tools/fastbev_run.sh#L93
), which represent the specified training partition and resource allocation type respectively. Maybe sometime I can try to provide a dist-based training script(https://github.com/open-mmlab/mmdetection/tree/master/tools).

As a example, my general command to start the training script is:
sh ./tools/fastbev_run.sh Test reserved

from fast-bev.

ymlab avatar ymlab commented on August 30, 2024

I tried 2 ways of train 1、sh tools/fastbev_run.sh (slurm_train $PARTITION 1 fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4)

MMDET3D: /data/rex/BEV/Fast-BEV-dev SRUN_ARGS: -s basename: missing operand Try 'basename --help' for more information. tools/fastbev_run.sh: 16: arithmetic expression: division by zero: "fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4<8?fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4:8"

2、python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py error: File "/home/snk/anaconda3/envs/fastbev/lib/python3.8/site-packages/nuscenes/nuscenes.py", line 225, in getind return self._token2ind[table_name][token] KeyError: '442c7729b9d0455ca75978f1a7fdab3a'

I am not sure about the specific reason for the second error.

from fast-bev.

Ignite616 avatar Ignite616 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.

File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

from fast-bev.

ymlab avatar ymlab commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.

File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

This might help: #18

from fast-bev.

hly2990 avatar hly2990 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.

File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

This might help: #18

I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4.
Maybe you can have a try.

from fast-bev.

Ignite616 avatar Ignite616 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.

File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

This might help: #18

First of all, thank you for your reply. I refer to # 18 and use the 4th and 5th GPU cards for training. Two GPUs are currently in use, but only calculated on the 5th GPU. How can I use both GPUs to calculate?
3753_1678001358_hd

from fast-bev.

Ignite616 avatar Ignite616 commented on August 30, 2024

I also meet an error in the test.
I have successfully used m1 model for visualization on a single GPU. In order to see the best results of the fast-bev model, I intend to use the m5 model for visualization. The memory of a GPU is not enough to support the visualization of the m5 model.
So I refer to mmdection/tools/dist_ test.sh, to use two GPUs for visualization.
my dist_test.bash is as follows.
2023-03-05 11-53-59 的屏幕截图
I run with bash tools/dist_test.sh configs/fastbev/exp/paper/fastbev_m5_r50_s512x1408_v250x250x6_c256_d6_f4.py download_model/m5_epoch_20.pth 2.
Then I meet an error, as follow.
2023-03-05 11-46-16 的屏幕截图

please how to solve it? 24=4x6

from fast-bev.

Ignite616 avatar Ignite616 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.

File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

This might help: #18

I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4. Maybe you can have a try.

Thank you for your prompt. I run with the command CUDA_VISIBLE_DEVICES=4,5 python -m torch.distributed.launch --nproc_per_node=2 tools/train.py configs/fastbev/exp/paper/fastbev_m5_r50_s512x1408_v250x250x6_c256_d6_f4.py --launcher="pytorch" --gpus 2 . But I received an error.
` raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


          tools/train.py FAILED               

==================================================
Root Cause:
[0]:
time: 2023-03-05_20:42:07
rank: 1 (local_rank: 1)
exitcode: -11 (pid: 43458)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 43458"

Other Failures:
<NO_OTHER_FAILURES>
**************************************************`

Is this because I didn't start with GPU 0. How should I use the fourth and fifth GPUs? Because I don't have permission to use the 0 GPU

from fast-bev.

sinsin1998 avatar sinsin1998 commented on August 30, 2024

I also running on multi-GPUs in single server. When I run python tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error.
Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.

File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

This might help: #18

I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4. Maybe you can have a try.

Hello, I try to run this code with 3090, but it's really slow. And I think there may be something wrong with the dataloader. Could you tell me your device and how long the train this model for 20 epochs.

from fast-bev.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.