Giter Site home page Giter Site logo

zhangxiaosong18 / freeanchor Goto Github PK

View Code? Open in Web Editor NEW
671.0 21.0 111.0 1.86 MB

FreeAnchor: Learning to Match Anchors for Visual Object Detection (NeurIPS 2019)

Home Page: https://arxiv.org/abs/1909.02466

License: MIT License

Python 86.44% C++ 5.26% Cuda 8.30%
freeanchor object-detection one-stage pytorch computer-vision neurips-2019

freeanchor's Introduction

FreeAnchor

The Code for "FreeAnchor: Learning to Match Anchors for Visual Object Detection".

This repository is based on maskrcnn-benchmark, and FreeAnchor has also been implemented in mmdetection, thanks @yhcao6 and @hellock.

architecture

New performance on COCO

We added multi-scale testing support and updated experiments. The previous version is in this branch.

Backbone Iteration Training scales Multi-scale
testing
AP
(minival)
AP
(test-dev)
Model
ResNet-50-FPN 90k 800 N 38.7 38.7 Link
ResNet-101-FPN 90k 800 N 40.5 40.9 Link
ResNet-101-FPN 180k [640, 800] N 42.7 43.1 Link
ResNet-101-FPN 180k [480, 960] N 43.2 43.9 Link
ResNet-101-FPN 180k [480, 960] Y 44.7 45.2 Link
ResNeXt-64x4d-101-FPN 180k [640, 800] N 44.5 44.9 Link
ResNeXt-64x4d-101-FPN 180k [480, 960] N 45.6 46.0 Link
ResNeXt-64x4d-101-FPN 180k [480, 960] Y 46.8 47.3 Link

Notes:

  • We use 8 GPUs with 2 image / GPU.
  • In multi-scale testing, we use image scales in {480, 640, 800, 960, 1120, 1280} and max_size are 1.666× than scales.

Installation

Check INSTALL.md for installation instructions.

Usage

You will need to download the COCO dataset and configure your own paths to the datasets.

For that, all you need to do is to modify maskrcnn_benchmark/config/paths_catalog.py to point to the location where your dataset is stored.

Config Files

We provide four configuration files in the configs directory.

Config File Backbone Iteration Training scales
configs/free_anchor_R-50-FPN_1x.yaml ResNet-50-FPN 90k 800
configs/free_anchor_R-101-FPN_1x.yaml ResNet-101-FPN 90k 800
configs/free_anchor_R-101-FPN_j2x.yaml ResNet-101-FPN 180k [640, 800]
configs/free_anchor_X-101-FPN_j2x.yaml ResNeXt-64x4d-101-FPN 180k [640, 800]
configs/free_anchor_R-101-FPN_e2x.yaml ResNet-101-FPN 180k [480, 960]
configs/free_anchor_X-101-FPN_e2x.yaml ResNeXt-64x4d-101-FPN 180k [480, 960]

Training with 8 GPUs

cd path_to_free_anchor
export NGPUS=8
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file "path/to/config/file.yaml"

Test on COCO test-dev

cd path_to_free_anchor
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/test_net.py --config-file "path/to/config/file.yaml" MODEL.WEIGHT "path/to/.pth file" DATASETS.TEST "('coco_test-dev',)"

Multi-scale testing

cd path_to_free_anchor
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/multi_scale_test.py --config-file "path/to/config/file.yaml" MODEL.WEIGHT "path/to/.pth file" DATASETS.TEST "('coco_test-dev',)"

Evaluate NMS Recall

cd path_to_free_anchor
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/eval_NR.py --config-file "path/to/config/file.yaml" MODEL.WEIGHT "path/to/.pth file"

Citations

Please consider citing our paper in your publications if the project helps your research.

@inproceedings{zhang2019freeanchor,
  title   =  {{FreeAnchor}: Learning to Match Anchors for Visual Object Detection},
  author  =  {Zhang, Xiaosong and Wan, Fang and Liu, Chang and Ji, Rongrong and Ye, Qixiang},
  booktitle =  {Neural Information Processing Systems},
  year    =  {2019}
}

freeanchor's People

Contributors

zhangxiaosong18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

freeanchor's Issues

Use Group Normalization

I need to use Group Normalization in my task. Is it as simple as setting
MODEL: USE_GN: True

positive bag loss decrease very quickly

Hi, I try to transfer free anchor loss into 3d object detection. But I found that the positive loss decreases very quickly in the first few iterations from 2.0+ to 0.7+. Have you ever met this problem? Could you please give me some advice? Thanks in advance.

distributed.deprecated

Why do you use torch.distributed.deprecated? All functions have counterparts in torch.distributed.

Question about test?

Hello, I have a question. Multi-scale tests are used on networks like RefineDet and CornerNet. Why are FCOS and your FreeAnchor networks not doing multi-scale testing? Is there any special reason?

AttributeError: nms

from maskrcnn_benchmark import _C
“cannot find any reference _C”,how can I deal with this problem?waiting for your reply.thanks

Undefined name 'mask_utils' in segmentation_mask.py

https://github.com/zhangxiaosong18/FreeAnchor/search?q=mask_utils&unscoped_q=mask_utils

flake8 testing of https://github.com/zhangxiaosong18/FreeAnchor on Python 3.7.1

$ flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

./maskrcnn_benchmark/structures/segmentation_mask.py:126:20: F821 undefined name 'mask_utils'
            rles = mask_utils.frPyObjects(
                   ^
./maskrcnn_benchmark/structures/segmentation_mask.py:129:19: F821 undefined name 'mask_utils'
            rle = mask_utils.merge(rles)
                  ^
./maskrcnn_benchmark/structures/segmentation_mask.py:130:20: F821 undefined name 'mask_utils'
            mask = mask_utils.decode(rle)
                   ^
3     F821 undefined name 'mask_utils'
3

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

  • F821: undefined name name
  • F822: undefined name name in __all__
  • F823: local variable name referenced before assignment
  • E901: SyntaxError or IndentationError
  • E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

CPU usage goes too high when running demo/webcam.py

image
it costs too much cpu resources when I run demo/webcom.py. It almost occupied all cpus, which makes other program has no cpu resources available. However, it doesn't happen while training or testing. I wonder why it cost so much cpu resource and how can I fix this problem?

I modified the code of webcom.py to make a single image as input.

Performance gap!

Hi, Thanks for your public code. I train this code only get 38.2 mAP on minival datasets. I want to know this is normal fluctuate?

formulation error

all sum func in eq 2 should be product and the Cij is not the same as Eq 1

please correct me if there is something wrong. thanks

Problem about calculation of loss

Dear author, I am really puzzled about the loss calculation in the following code segmentation in free_anchor_loss.py, could you explain to me please?

Thanks in advance!

with torch.set_grad_enabled(False):
    box_localization = self.box_coder.decode(box_regression_, anchors_.bbox)
    object_box_iou = boxlist_iou(
        targets_,
        BoxList(box_localization, anchors_.size, mode='xyxy')
    )
    H = object_box_iou.max(dim=1, keepdim=True).values.clamp(
        min=self.bbox_threshold + 1e-12)
    object_box_prob = (
            (object_box_iou - self.bbox_threshold) / (H - self.bbox_threshold)
    ).clamp(min=0, max=1)

    indices = torch.stack(
        [torch.arange(len(labels_)).type_as(labels_), labels_], dim=0)

    """
    to implement image_box_iou = torch.sparse.max(
                      torch.sparse_coo_tensor(indices, object_box_iou), dim=0
                 )
    """
    # start
    indices = torch.nonzero(torch.sparse.sum(
        torch.sparse_coo_tensor(indices, object_box_prob), dim=0
    ).to_dense()).t_()

    if indices.numel() == 0:
        image_box_prob = torch.zeros(anchors_.bbox.size(0),
                                     self.num_classes).type_as(object_box_prob)
    else:
        nonzero_box_prob = torch.where(
            (labels_.unsqueeze(dim=-1) == indices[0]),
            object_box_prob[:, indices[1]],
            torch.tensor([0]).type_as(object_box_prob)
        ).max(dim=0).values

        image_box_prob = torch.sparse_coo_tensor(
            indices.flip([0]), nonzero_box_prob,
            size=(anchors_.bbox.size(0), self.num_classes)
        ).to_dense()
    # end

训练loss的一些问题

您好,很感谢您的分享。在读论文的时候关于训练loss有些不太明白,想请教下您:
Loss的部分为什么可以用Mean-max(X)以及FL_(p)替代呢?这里不是很懂,可以说下推导过程么?谢谢您!

import _c error

I followed the install.me process.
But there is an error like the following.

File "tools/train_net.py", line 18, in
from maskrcnn_benchmark.engine.inference import inference
File "/workspace/FreeAnchor/maskrcnn_benchmark/engine/inference.py", line 20, in
from maskrcnn_benchmark.structures.boxlist_ops import boxlist_iou
File "/workspace/FreeAnchor/maskrcnn_benchmark/structures/boxlist_ops.py", line 6, in
from maskrcnn_benchmark.layers import nms as _box_nms
File "/workspace/FreeAnchor/maskrcnn_benchmark/layers/init.py", line 8, in
from .nms import nms
File "/workspace/FreeAnchor/maskrcnn_benchmark/layers/nms.py", line 3, in
from maskrcnn_benchmark import _C
ImportError: /workspace/FreeAnchor/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE

Package Version Location


backcall 0.1.0
certifi 2019.11.28
cffi 1.13.2
cycler 0.10.0
Cython 0.29.14
decorator 4.4.1
ipython 7.12.0
ipython-genutils 0.2.0
jedi 0.16.0
kiwisolver 1.1.0
maskrcnn-benchmark 0.1 /workspace/FreeAnchor
matplotlib 3.1.3
mkl-fft 1.0.15
mkl-random 1.1.0
mkl-service 2.3.0
numpy 1.18.1
olefile 0.46
parso 0.6.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 4.1.1
pip 20.0.2
prompt-toolkit 3.0.3
ptyprocess 0.6.0
pycocotools 2.0.0
pycparser 2.19
Pygments 2.5.2
pyparsing 2.4.6
python-dateutil 2.8.1
PyYAML 5.3
setuptools 45.1.0.post20200127
six 1.14.0
torch 1.1.0
torchvision 0.2.1
tqdm 4.42.1
traitlets 4.3.3
wcwidth 0.1.8
wheel 0.34.2
yacs 0.1.6

How do I solve this problem?

Difference between code and paper

When

so

then

but in your code https://github.com/zhangxiaosong18/FreeAnchor/blob/master/maskrcnn_benchmark/modeling/rpn/free_anchor_loss.py#L161,
you just use (matched_cls_prob in your code) as ,
that means you just ignore the other predicted classes which not matching the target class, and I think it's different with retinanet_cls_loss defined in https://github.com/zhangxiaosong18/FreeAnchor/blob/master/maskrcnn_benchmark/modeling/rpn/retinanet_loss.py#L142.

I try to rewrite the code calculating matched_cls_prob as blew:

labels_mul = torch.zeros([len(labels_), self.num_classes])
for i in range(len(labels_)):
    labels_mul[i, labels_[i]] = 1

labels_mul = labels_mul.unsqueeze(1).repeat(1, self.pre_anchor_topk, 1)

loss_mul_class = nn.BCELoss(reduction="none")(cls_prob_[matched], labels_mul).sum(dim=-1)
matched_cls_prob = (-loss_mul_class).exp()

Did I get it wrong ? @zhangxiaosong18

positive_loss is much larger than negative_loss

Train log:
2019-10-11 16:29:21,673 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:40:30 iter: 260 loss: 3.8217 (3.8392) negative_loss: 0.0326 (0.0354) positive_loss: 3.7731 (3.8038) time: 1.3759 (1.4870) data: 0.0050 (0.0068) lr: 0.000680 max mem: 7173
2019-10-11 16:29:48,522 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:29:44 iter: 280 loss: 3.7343 (3.8322) negative_loss: 0.0513 (0.0364) positive_loss: 3.6920 (3.7958) time: 1.2558 (1.4766) data: 0.0049 (0.0067) lr: 0.000707 max mem: 7173
2019-10-11 16:30:17,056 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:25:56 iter: 300 loss: 3.5909 (3.8169) negative_loss: 0.0517 (0.0395) positive_loss: 3.5172 (3.7775) time: 1.1965 (1.4733) data: 0.0047 (0.0066) lr: 0.000733 max mem: 7173_
It's normal?

RuntimeError when training

I only change the config file (free_anchor_R-50-FPN_test.txt), and got the error

2020-01-16 15:49:05,738 maskrcnn_benchmark.trainer INFO: eta: 3:46:57 iter: 244400 loss: 1.7938 (1.8977) loss_retina_positive: 1.6451 (1.7404) loss_retina_negative: 0.1402 (0.1573) time: 0.1097 (0.1178) data: 0.0042 (0.0045) lr: 0.010000 max mem: 1404
2020-01-16 15:49:08,161 maskrcnn_benchmark.trainer INFO: eta: 3:46:55 iter: 244420 loss: 1.7646 (1.8977) loss_retina_positive: 1.6248 (1.7404) loss_retina_negative: 0.1239 (0.1573) time: 0.1109 (0.1178) data: 0.0041 (0.0045) lr: 0.010000 max mem: 1404
2020-01-16 15:49:10,560 maskrcnn_benchmark.trainer INFO: eta: 3:46:52 iter: 244440 loss: 1.8001 (1.8977) loss_retina_positive: 1.6412 (1.7404) loss_retina_negative: 0.1554 (0.1573) time: 0.1126 (0.1178) data: 0.0040 (0.0045) lr: 0.010000 max mem: 1404
2020-01-16 15:49:12,817 maskrcnn_benchmark.trainer INFO: eta: 3:46:50 iter: 244460 loss: 1.7907 (1.8977) loss_retina_positive: 1.6191 (1.7404) loss_retina_negative: 0.1470 (0.1573) time: 0.1076 (0.1178) data: 0.0037 (0.0045) lr: 0.010000 max mem: 1404
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=59 : device-side assert triggered
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [2887,0,0], thread: [16,0,0] Assertion *input >= 0. && *input <= 1. failed.
Traceback (most recent call last):
File "tools/train_net.py", line 171, in
main()
File "tools/train_net.py", line 164, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/home/zz/work/FreeAnchor/maskrcnn_benchmark/engine/trainer.py", line 70, in do_train
loss_dict_reduced = reduce_loss_dict(loss_dict)
File "/home/zz/work/FreeAnchor/maskrcnn_benchmark/engine/trainer.py", line 28, in reduce_loss_dict
all_losses = torch.stack(all_losses, dim=0)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THC/THCCachingHostAllocator.cpp:265
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcb2ed3fdc5 in /home/zz/anaconda3/envs/fa/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x14792 (0x7fcb2bc1c792 in /home/zz/anaconda3/envs/fa/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7fcb2ed2f640 in /home/zz/anaconda3/envs/fa/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x3067fb (0x7fcb2c33c7fb in /home/zz/anaconda3/envs/fa/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x14019b (0x7fcb54b2019b in /home/zz/anaconda3/envs/fa/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x3bfc84 (0x7fcb54d9fc84 in /home/zz/anaconda3/envs/fa/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3bfcd1 (0x7fcb54d9fcd1 in /home/zz/anaconda3/envs/fa/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x19dfce (0x56446760afce in /home/zz/anaconda3/envs/fa/bin/python)
frame #8: + 0x113a6b (0x564467580a6b in /home/zz/anaconda3/envs/fa/bin/python)
frame #9: + 0x103948 (0x564467570948 in /home/zz/anaconda3/envs/fa/bin/python)
frame #10: + 0x114267 (0x564467581267 in /home/zz/anaconda3/envs/fa/bin/python)
frame #11: + 0x11427d (0x56446758127d in /home/zz/anaconda3/envs/fa/bin/python)
frame #12: + 0x11427d (0x56446758127d in /home/zz/anaconda3/envs/fa/bin/python)
frame #13: + 0x11427d (0x56446758127d in /home/zz/anaconda3/envs/fa/bin/python)
frame #14: PyDict_SetItem + 0x502 (0x5644675cc602 in /home/zz/anaconda3/envs/fa/bin/python)
frame #15: PyDict_SetItemString + 0x4f (0x5644675cd0cf in /home/zz/anaconda3/envs/fa/bin/python)
frame #16: PyImport_Cleanup + 0x9e (0x56446760c91e in /home/zz/anaconda3/envs/fa/bin/python)
frame #17: Py_FinalizeEx + 0x67 (0x564467682367 in /home/zz/anaconda3/envs/fa/bin/python)
frame #18: + 0x227d93 (0x564467694d93 in /home/zz/anaconda3/envs/fa/bin/python)
frame #19: _Py_UnixMain + 0x3c (0x5644676950bc in /home/zz/anaconda3/envs/fa/bin/python)
frame #20: __libc_start_main + 0xe7 (0x7fcb651ccb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: + 0x1d0990 (0x56446763d990 in /home/zz/anaconda3/envs/fa/bin/python)

free_anchor_R-50-FPN_test.txt

Any step-by-step results available?

Thanks for sharing this great work :)

Just wonder if you have any step by step experiments for building the final loss formulation.

I tried to decompose the loss and see the effectiveness of each term, but it turns out that the loss works as a whole and sensitive to some part changes in my case.

Really appreciate your time.

import _c error

Traceback (most recent call last):
File "tools/multi_scale_test.py", line 7, in
from maskrcnn_benchmark.engine.inference import inference
File "/home/hansol/PycharmProjects/FreeAnchor/maskrcnn_benchmark/engine/inference.py", line 20, in
from maskrcnn_benchmark.structures.boxlist_ops import boxlist_iou
File "/home/hansol/PycharmProjects/FreeAnchor/maskrcnn_benchmark/structures/boxlist_ops.py", line 6, in
from maskrcnn_benchmark.layers import nms as _box_nms
File "/home/hansol/PycharmProjects/FreeAnchor/maskrcnn_benchmark/layers/init.py", line 8, in
from .nms import nms
File "/home/hansol/PycharmProjects/FreeAnchor/maskrcnn_benchmark/layers/nms.py", line 3, in
from maskrcnn_benchmark import _C
ImportError: /home/hansol/PycharmProjects/FreeAnchor/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe28TypeMeta21_typeMetaDataInstanceIN3c108BFloat16EEEPKNS_6detail12TypeMetaDataEv
Traceback (most recent call last):
File "/home/hansol/anaconda3/envs/free/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/hansol/anaconda3/envs/free/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/hansol/anaconda3/envs/free/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/home/hansol/anaconda3/envs/free/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/hansol/anaconda3/envs/free/bin/python', '-u', 'tools/multi_scale_test.py', '--local_rank=0', '--config-file', 'config/free_anchor_X-101-FPN_e2x.yaml', 'MODEL.WEIGHT', 'free_anchor_X-64x4d-101-FPN_e2x.pth', 'DATASETS.TEST', 'coco_2017_val']' returned non-zero exit status 1.

I followed the install.me course but there are the above errors.
How can we solve this?

about test process

Hi, Thank you for your excellent work. I have a question about the test. So you just revised the loss in the training. That's not any revised in the test compare with retinenet?

loss increases when training

Hi, I try to transfer the free anchor loss into 3d object detection. But when training the positiver loss increase gradually and negative loss decrease instead. Have you ever met this kind of problem?

【AP】got unexpected AP results

I downloaded some pretrain_model .pth file using the links.
But when I do test on COCO_test-dev, all the AP results I got are -1.00...

command:
$ export NGPUS=4

$ python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/test_net.py --config-file configs/free_anchor_R-50-FPN_1x.yaml MODEL.WEIGHT Pretrained_Models/free_anchor_R-50-FPN_1x.pth DATASETS.TEST "('coco_test-dev',)"

Why? How to solve this problem? Thanks

Compile Failed on Windows 10

I tried to run this command : python setup.py build develop but these error raised :


D:/Artificial Intelligence/Object Detection/FreeAnchor-master/FreeAnchor-master/maskrcnn_benchmark/csrc/cuda/ROIAlign_cuda.cu(275): error: no instance of function template "THCCeilDiv" matches the argument list
            argument types are: (long long, long)

D:/Artificial Intelligence/Object Detection/FreeAnchor-master/FreeAnchor-master/maskrcnn_benchmark/csrc/cuda/ROIAlign_cuda.cu(275): error: no instance of overloaded function "std::min" matches the argument list
            argument types are: (<error-type>, long)

D:/Artificial Intelligence/Object Detection/FreeAnchor-master/FreeAnchor-master/maskrcnn_benchmark/csrc/cuda/ROIAlign_cuda.cu(320): error: no instance of function template "THCCeilDiv" matches the argument list
            argument types are: (int64_t, long)

D:/Artificial Intelligence/Object Detection/FreeAnchor-master/FreeAnchor-master/maskrcnn_benchmark/csrc/cuda/ROIAlign_cuda.cu(320): error: no instance of overloaded function "std::min" matches the argument list
            argument types are: (<error-type>, long)

4 errors detected in the compilation of "C:/Users/127051/AppData/Local/Temp/tmpxft_000021a0_00000000-10_ROIAlign_cuda.cpp1.ii".
error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.1\\bin\\nvcc.exe' failed with exit status 1

RuntimeError: CUDA error: an illegal memory access was encountered

I met this problem with 2080ti. It occurs error after several epoch. Setting image/gpu from 2 to 1 and reducing the gtBoxes of per image didn't work.
Train log:

out of memory
invalid argument
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
...
overlaps_th = torch.tensor(overlaps).to(boxlist1.bbox.device) #[N, M]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb1e9515813 in /home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)

This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
GPU: 2080ti
Driver Version: 418.67
Python Version: 3.7
cuda Version: 10.1
cudnn Version: 7
pytorch Version: torch-1.1.0, torchvision-0.2.0

I try to set CUDA_LAUNCH_BLOCKING to 1 and meet the same problem. How can I solve it?

RuntimeError: CUDA error: device-side assert triggered

Hi,I'm trying to run **CUDA_VISIBLE_DEVICES=2 python -m torch.distributed.launch --nproc_per_node=1 tools/train_net.py --config-file configs/free_anchor_R-50-FPN_1x.yaml** on a single Geforce 1080ti GPU. It could run for 760 iterations and I got the following error:

...
2019-09-14 09:19:42,314 maskrcnn_benchmark.trainer INFO: eta: 8:44:11  iter: 740  loss: 3.6038 (3.8930)  loss_retina_positive: 3.4092 (3.6858)  loss_retina_negative: 0.1473 (0.2072)  time: 0.3526 (0.3524)  data: 0.0130 (0.0210)  lr: 0.010000  max mem: 4004
2019-09-14 09:19:49,227 maskrcnn_benchmark.trainer INFO: eta: 8:43:48  iter: 760  loss: 3.6373 (3.8917)  loss_retina_positive: 3.4743 (3.6837)  loss_retina_negative: 0.1995 (0.2080)  time: 0.3483 (0.3522)  data: 0.0153 (0.0209)  lr: 0.010000  max mem: 4004
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [4,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [8,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [13,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [14,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype *, const Dtype *, Dtype *) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [15,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
Traceback (most recent call last):
  File "tools/train_net.py", line 171, in <module>
    main()
  File "tools/train_net.py", line 164, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/home/zlq/code/FreeAnchor/maskrcnn_benchmark/engine/trainer.py", line 74, in do_train
    loss_dict = model(images, targets)
  File "/home/zlq/anaconda3/envs/torch2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zlq/code/FreeAnchor/maskrcnn_benchmark/modeling/detector/retinanet.py", line 62, in forward
    (anchors, detections), detector_losses = self.rpn(images, rpn_features, targets)
  File "/home/zlq/anaconda3/envs/torch2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zlq/code/FreeAnchor/maskrcnn_benchmark/modeling/rpn/retinanet.py", line 152, in forward
    return self._forward_train(anchors, box_cls, box_regression, targets)
  File "/home/zlq/code/FreeAnchor/maskrcnn_benchmark/modeling/rpn/retinanet.py", line 159, in _forward_train
    anchors, box_cls, box_regression, targets
  File "/home/zlq/code/FreeAnchor/maskrcnn_benchmark/modeling/rpn/free_anchor_loss.py", line 114, in __call__
    (object_box_iou - self.bbox_threshold) / (H - self.bbox_threshold)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
  File "/home/zlq/anaconda3/envs/torch2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zlq/anaconda3/envs/torch2/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zlq/anaconda3/envs/torch2/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/home/zlq/anaconda3/envs/torch2/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/home/zlq/anaconda3/envs/torch2/bin/python', '-u', 'tools/train_net.py', '--local_rank=0', '--config-file', 'configs/free_anchor_R-50-FPN_1x.yaml']' returned non-zero exit status 1.

Did anyone meet this error and please give me some suggestions on sloving this! Thanks a lot!

dimension specified as 0 but tensor has no dimensions

the issue is as follows:
Traceback (most recent call last):
File "/usr/share/pycharm/helpers/pydev/pydevd.py", line 1758, in
main()
File "/usr/share/pycharm/helpers/pydev/pydevd.py", line 1752, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/usr/share/pycharm/helpers/pydev/pydevd.py", line 1147, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/usr/share/pycharm/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/media/zy/Documents/FreeAnchor/tools/train_net.py", line 171, in
main()
File "/media/zy/Documents/FreeAnchor/tools/train_net.py", line 164, in main
model = train(cfg, args.local_rank, args.distributed)
File "/media/zy/Documents/FreeAnchor/tools/train_net.py", line 73, in train
arguments,
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/engine/trainer.py", line 56, in do_train
for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
File "/media/zy/Software/miniconda/envs/free_anchor/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/media/zy/Software/miniconda/envs/free_anchor/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
File "/media/zy/Software/miniconda/envs/free_anchor/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/media/zy/Software/miniconda/envs/free_anchor/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/data/datasets/coco.py", line 58, in getitem
img, target = self.transforms(img, target)
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/data/transforms/transforms.py", line 15, in call
image, target = t(image, target)
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/data/transforms/transforms.py", line 70, in call
image, target = resizer(image, target)
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/data/transforms/transforms.py", line 58, in call
target = target.resize(image.size)
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/structures/bounding_box.py", line 124, in resize
v = v.resize(size, *args, **kwargs)
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/structures/segmentation_mask.py", line 184, in resize
scaled.append(polygon.resize(size, *args, **kwargs))
File "/media/zy/Documents/FreeAnchor/maskrcnn_benchmark/structures/segmentation_mask.py", line 117, in resize
p[0::2] *= ratio_w
IndexError: dimension specified as 0 but tensor has no dimensions

the version of pytorch is 1.1.0.
How to fix the problem

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.