megvii-model / yolof Goto Github PK

View Code? Open in Web Editor NEW

804.0 804.0 115.0 812 KB

License: MIT License

Python 89.90% C++ 3.65% Cuda 6.22% Dockerfile 0.05% Shell 0.18%

yolof's People

Contributors

Stargazers

Watchers

Forkers

objectdetection scott-mao templeblock chaocunchen peternara jiyuxuan926 happog xxxhycl2010 sailfish009 mathpopo trantorrepository chaoso bfyskuy xiangshougudu hito0512 123xu223 koala-good avatar196kc trendingtechnology nuaaceieyty piaofu110 wuhantop3 fateeeeee baodijun zihan987 lafreze qilei123 bugmany dongfengxijian douglas2code apeizou lujianyao zuyongwu mc261670164 yanggui19891007 hxl1990 cuicaicai yuxiaowei2001 shadowkun hiyyg tangjiuqi097 fangchaohu1211 zwh930712 collector-m xrosliang solalet lai6964 qingniaoihep yushanshan05 dangweili crystalsixone hqx11 note-liu lovecove asrarh belitk wangdeyu kangyuzhe666 kevinzhaozzl xwjbupt yjhuasheng akorobeinikov amirunpri2018 zgsxwsdxg cenchaojun cvlinks wilbur-lqw lijian8 funian788 harveyyeung bilal-a-qureshi virlorenzxx cosmoshua yangyin2016 liuyundong-2020 joene-zhou zongyi1999 liulangxing major-yuan jichengyuan junqiangchen zyg11 clarissadhl caffeinatemenow gaofssvm zkw981 enchanterfan yulongnan autowangshuo trameshape tianjun-world tianhaofu yangxin6 robotseye eden-zhou z1j1n1 gassolid hjx422 zengxideng zxc1667543276

yolof's Issues

paper

hello, the effect of large objects is a 9.3% difference, which is not small compared to others

Weight Initialization method in DilatedEncoder

YOLOF/playground/detection/coco/yolof/yolof_base/encoder.py

Lines 64 to 78 in 6189487

    
           def _init_weight(self): 
        
               weight_init.c2_xavier_fill(self.lateral_conv) 
        
               weight_init.c2_xavier_fill(self.fpn_conv) 
        
               for m in [self.lateral_norm, self.fpn_norm]: 
        
                   nn.init.constant_(m.weight, 1) 
        
                   nn.init.constant_(m.bias, 0) 
        
               for m in self.dilated_encoder_blocks.modules(): 
        
                   if isinstance(m, nn.Conv2d): 
        
                       nn.init.normal_(m.weight, mean=0, std=0.01) 
        
                       if hasattr(m, 'bias') and m.bias is not None: 
        
                           nn.init.constant_(m.bias, 0) 
        
                   if isinstance(m, (nn.GroupNorm, nn.BatchNorm2d, nn.SyncBatchNorm)): 
        
                       nn.init.constant_(m.weight, 1) 
        
                       nn.init.constant_(m.bias, 0)

Why the weight initialization method for lateral_conv and fpn_conv is different from the weight initialization method for conv layer in dilated_encoder_blocks?

unable to test

I noticed that there is already a cvpods directory in the project. Do I need to download cvpods again?

Do I need to delete the existing cvpods if I download it again?

How to install cvpods if I don’t need to download it? Because if I use pip install -e cvpods --user to install directly, it will be unavailable due to lack of setup.py.

Compatibility with anchor-free methods

Hi,
Great and very interesting work!
I wonder if you've experimented with anchor-free methods and YOLOF.
In ATSS they show a retinanet-style network that is anchor-free (only 1 anchor).

Could this be achieved with YOLOF?

Could you explain figure 4 in your paper?

Thank you for your great work. But I still don't understand figure 4 in your paper. (green dots stand for? 3 consecutive dots, 2 consecutive dots, 1 consecutive dot mean?)
If it's possible, please describe the figure 4.
Thank you so much!

How to get FLOPS and params of YOLOF?

I don't find relevent code in YOLOF and cvpods

Can't attain test result:" AssertionError: Box regression deltas become infinite or NaN!"

Hi, we have tried the YOLOF code following your README, the code can start training correctly. However, after a few hundred(400~500) iterations, the YOLOF report: " AssertionError: Box regression deltas become infinite or NaN!"
We have tried to reduce the Learning Ratio defined as " cfg.SOLVER.OPTIMIZER.BASE_LR" from original to 0.005, but it doesn't work.
Could you please give us some advice?
Thanks a lot!
Sanyaoshanren

How to understand the calculation process of normalized_cls_score？

I did not understand this code, can anyone help me understand this code?

Training on custom dataset

How do I train on a custom dataset?

'no mudule named ''config'

Hi! I have questions to disturb you.
When trying to run train_net.py, I have no way to solve 'from config import config'.when the error exists'no mudule named ''config',I try to 'pip install config'.But there are still errors.I have searched for some way,but no way works.Can you help me ?

the pretrained model of CSPDarkNet-53 is not aviliable, please help!

https://1drv.ms/u/s!AgM0VtBH3kV9imGxZX3n_TMQGtbP?e=YMgpGJ
is not an aviliable link, and also the models' link.

request for yolof-dc5 model

Can I get a model config and pre-trained weights of the model denoted in this table?

cannot import name '_C' from 'cvpods'

Hello, I'm debugging the train under the tools folder 'train_net.py'.But in the cvpods folder, on line 11 of the deform_conv script :
from cvpods import C. report an error here:
cannot import name ' C 'from' cvpods'

Why is this? Do I need to install any plug-ins?
Thanks！

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

hello, when i run with only one GPU, i get the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group."

can not install!

just use pytorch can make the yolof eather. So why are you make this easy project that complex? And can not install .

Question about ignore threshold?

The paper points out the importance of balance matching of label assign, but actually in the code you ignores the anchor(pred) whose iou>NEG_IGNORE_THRESHOLD or < POS_IGNORE_THRESHOLD, which may break the balance?

Also, it seems you ignore the anchor(pred) whose iou > NEG_IGNORE_THRESHOLD only make sense with max-iou label assign, why you use the the anchor(pred) whose iou is maxium or between NEG_IGNORE_THRESHOLD and NEG_IGNORE_THRESHOLD ?

YOLOF/playground/detection/coco/yolof/yolof_base/yolof.py

Lines 186 to 231 in 8628813

    
           ignore_idx = ious > self.neg_ignore_thresh 
        
           pos_ious = torch.cat(pos_ious) 
        
           pos_ignore_idx = pos_ious < self.pos_ignore_thresh 
        
           src_idx = torch.cat( 
        
               [src + idx * anchors[0].tensor.shape[0] for idx, (src, _) in 
        
                enumerate(indices)]) 
        
           gt_classes = torch.full(pred_class_logits.shape[:1], 
        
                                   self.num_classes, 
        
                                   dtype=torch.int64, 
        
                                   device=pred_class_logits.device) 
        
           gt_classes[ignore_idx] = -1 
        
           target_classes_o = torch.cat( 
        
               [t.gt_classes[J] for t, (_, J) in zip(gt_instances, indices)]) 
        
           target_classes_o[pos_ignore_idx] = -1 
        
           gt_classes[src_idx] = target_classes_o 
        
           valid_idxs = gt_classes >= 0 
        
           foreground_idxs = (gt_classes >= 0) & (gt_classes != self.num_classes) 
        
           num_foreground = foreground_idxs.sum() 
        
           gt_classes_target = torch.zeros_like(pred_class_logits) 
        
           gt_classes_target[foreground_idxs, gt_classes[foreground_idxs]] = 1 
        
           dist.all_reduce(num_foreground) 
        
           num_foreground = num_foreground * 1.0 / dist.get_world_size() 
        
           # cls loss 
        
           loss_cls = sigmoid_focal_loss_jit( 
        
               pred_class_logits[valid_idxs], 
        
               gt_classes_target[valid_idxs], 
        
               alpha=self.focal_loss_alpha, 
        
               gamma=self.focal_loss_gamma, 
        
               reduction="sum", 
        
           ) 
        
           # reg loss 
        
           target_boxes = torch.cat( 
        
               [t.gt_boxes.tensor[i] for t, (_, i) in zip(gt_instances, indices)], 
        
               dim=0) 
        
           target_boxes = target_boxes[~pos_ignore_idx] 
        
           matched_predicted_boxes = predicted_boxes.reshape(-1, 4)[ 
        
               src_idx[~pos_ignore_idx]] 
        
           loss_box_reg = (1 - torch.diag(generalized_box_iou( 
        
               matched_predicted_boxes, target_boxes))).sum() 
        
           return {

RuntimeError: Error compiling objects for extension

I am using pythorch 1.11 and cuda with 11.3 , but the error come out . What version are you guys using?

Is there a test procedure，After training, how to test a picture with python program，

Please remove cvpods and only keep the experiments folds.

Including an individual cvpods will cause name conflicts if a user have already install cvpods.
So please remove all cvpods related stuff and only keep your experiments.

For example,

AssertionError: Box regression deltas become infinite or NaN!

i have only one gpu (8gb)
here is my log.txt
log.txt
how do i set learning rate warming iter and so on

KeyError: "No object named 'JitterCrop' found in 'transforms' registry!"

Env:

OS version：CentOS 7
Environment version: Python 3.6.13 / PyTorch 1.8.0 / CUDA 11.0 / cuDNN version
cvpods version：commit id : 82374a1fe5c83e16bdd4d07adae2545d51138ef8_

I ran the following command：

cd playground/detection/coco/yolof/yolof.cspdarknet53.DC5.3x
pods_train --num-gpus 8

Actual results：


Traceback (most recent call last):
  File "/ssd/serser/cvpods/tools/train_net.py", line 109, in <module>
    args=(args,),
  File "/ssd/serser/cvpods/cvpods/engine/launch.py", line 53, in launch
    daemon=False,
  File "/ssd/serser/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/ssd/serser/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/ssd/serser/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/ssd/serser/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/ssd/serser/cvpods/cvpods/engine/launch.py", line 88, in _distributed_worker
    main_func(*args)
  File "/ssd/serser/cvpods/tools/train_net.py", line 74, in main
    runner = runner_decrator(RUNNERS.get(cfg.TRAINER.NAME))(cfg, build_model)
  File "/ssd/serser/cvpods/cvpods/engine/runner.py", line 86, in __init__
    self.data_loader = self.build_train_loader(cfg)
  File "/ssd/serser/cvpods/cvpods/engine/runner.py", line 307, in build_train_loader
    return build_train_loader(cfg)
  File "/ssd/serser/cvpods/cvpods/data/build.py", line 130, in build_train_loader
    transform_gens = build_transform_gens(cfg.INPUT.AUG.TRAIN_PIPELINES)
  File "/ssd/serser/cvpods/cvpods/data/build.py", line 69, in build_transform_gens
    return build(pipelines)
  File "/ssd/serser/cvpods/cvpods/data/build.py", line 58, in build
    tfm = TRANSFORMS.get(aug)(**args)
  File "/ssd/serser/cvpods/cvpods/utils/registry.py", line 66, in get
    "No object named '{}' found in '{}' registry!".format(name, self._name)
KeyError: "No object named 'JitterCrop' found in 'transforms' registry!"

How do you implement the single in multiple out experiments?

Hi. In your paper, Fig 1 shows SiMo style can get competitively good results, but I am wondering about the detailed architecture during experiments. How do you generate multiple outputs with a single input? Do you pick up outputs from the deeper layers to make predictions like how SSD has done?

PASCAL VOC results

Have anyones trained this code in pascal voc? I had implemented a version of tensorflow codes, but it performed slightly worsh compared with RetinaNet and FCOS. And there is my codes: https://github.com/JiXuKong/YOLOF

About NAN during training

@chensnathan

I think the following passage may be the main cause of the NAN problem:

normalized_cls_score = cls_score + objectness - torch.log(
1. + torch.clamp(cls_score.exp(), max=self.INF) + torch.clamp(
objectness.exp(), max=self.INF))

There is an exp operation in this code, and clip is used to clip it to avoid explosion, but this still has hidden dangers, that is, before clipping, the exp may have exploded and overflowed, so the clip is useless at this time.

So, I changed to clip first and then exp,:

normalized_cls_pred = cls_pred + obj_pred - torch.log(
1. +
torch.clamp(cls_pred, max=DEFAULT_EXP_CLAMP).exp() +
torch.clamp(obj_pred, max=DEFAULT_EXP_CLAMP).exp())
where DEFAULT_EXP_CLAMP = log(INF).

After above modification, NAN problem no longer encountered.

paper related problem

Hi! I really appreciate the authors for doing such an inspiring job and sharing the code.

Somehow, I am kind of confused by the Sec. 4.1 of the paper. In my understanding, comparing MiMo, SiSo using C5 may cause some problems in detecting small object (High level feature with low resolution is preferred for large scale objects). To bridge the gap between SiSo and MiMo, why should we focus on larger objects? Fig 4 discusses about receptive fields. However, receptive fields of low level feature is smaller than C5, if SiSo has the receptive field problem, MiMo should have too!

So, I don't really understand this part. I will be glad if you point me out the problem.

Thanks in advance.

How to modify the identifier of GPU and the number of GPU to train the model?

Hello, I want to use the under the tools folder 'train_net' script to train the yolof-res101-dc5-1x version of the network, but because the first card of my group's server is occupied by others, I want to use other cards to train, I did not find the statement to modify the GPU number in 'setup' script. so I put num_ gpu，num_ machines and machines_ rank parameters are all changed to 1, but they are still trained with GPU: 0. How to solve it?

Thanks !

How to use pretraining weights for training?

How to use pretraining weights for training? After downloading the pretraining weight, which folder should I put it in? How to call the pretraining weight file?

Thanks！

The 'RetNext' of Table1 in papre should be 'ResNext'?

"No object named 'RandomShift' found in 'transforms' registry!"

problem

Hi, author, I have the following problems in running your model

Traceback (most recent call last):
File "/DATA/xiexu/yolo/cvpods/tools/train_net.py", line 109, in
args=(args,),
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/launch.py", line 56, in launch
main_func(*args)
File "/DATA/xiexu/yolo/cvpods/tools/train_net.py", line 74, in main
runner = runner_decrator(RUNNERS.get(cfg.TRAINER.NAME))(cfg, build_model)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/runner.py", line 86, in init
self.data_loader = self.build_train_loader(cfg)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/runner.py", line 307, in build_train_loader
return build_train_loader(cfg)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 130, in build_train_loader
transform_gens = build_transform_gens(cfg.INPUT.AUG.TRAIN_PIPELINES)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 69, in build_transform_gens
return build(pipelines)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 58, in build
tfm = TRANSFORMS.get(aug)(**args)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/utils/registry.py", line 66, in get
"No object named '{}' found in '{}' registry!".format(name, self._name)
KeyError: "No object named 'RandomShift' found in 'transforms' registry!"

bash: pods_train: command not found

SiMo

hello, how to train the SiMo model

How to convert your moder to onnx?

which file can create model and so I can use torch.onnx._export to convert pth to onnx?

KeyError: "No object named 'RandomShift' found in 'transforms' registry!"

[04/14 13:57:07 c2.utils.env.env]: Using a generated random seed 7889427
Traceback (most recent call last):
File "/media/ubun/CC7251E47251D3B4/yoloseries/YOLOF-main/tools/train_net.py", line 109, in
args=(args,),
File "/home/ubun/cvpods/cvpods/engine/launch.py", line 53, in launch
daemon=False,
File "/home/ubun/anaconda3/envs/detectron2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubun/anaconda3/envs/detectron2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/ubun/anaconda3/envs/detectron2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ubun/anaconda3/envs/detectron2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ubun/cvpods/cvpods/engine/launch.py", line 88, in _distributed_worker
main_func(*args)
File "/media/ubun/CC7251E47251D3B4/yoloseries/YOLOF-main/tools/train_net.py", line 74, in main
runner = runner_decrator(RUNNERS.get(cfg.TRAINER.NAME))(cfg, build_model)
File "/home/ubun/cvpods/cvpods/engine/runner.py", line 86, in init
self.data_loader = self.build_train_loader(cfg)
File "/home/ubun/cvpods/cvpods/engine/runner.py", line 307, in build_train_loader
return build_train_loader(cfg)
File "/home/ubun/cvpods/cvpods/data/build.py", line 130, in build_train_loader
transform_gens = build_transform_gens(cfg.INPUT.AUG.TRAIN_PIPELINES)
File "/home/ubun/cvpods/cvpods/data/build.py", line 69, in build_transform_gens
return build(pipelines)
File "/home/ubun/cvpods/cvpods/data/build.py", line 58, in build
tfm = TRANSFORMS.get(aug)(**args)
File "/home/ubun/cvpods/cvpods/utils/registry.py", line 66, in get
"No object named '{}' found in '{}' registry!".format(name, self._name)
KeyError: "No object named 'RandomShift' found in 'transforms' registry!"

python3: can't open file '/tmp/pip-req-build-6jhijoew/tools/train_net.py

I follow the instructions and install dependencies,
cd /YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x
But when I input:
pods_train --num-gpus 2
I get a problem:
python3: can't open file '/tmp/pip-req-build-6jhijoew/tools/train_net.py': [Errno 2] No such file or directory
this what happen?

Save the file

[09/30 05:59:20 c2.utils.dump.events]: eta: 0:00:21 iter: 89940/90000 total_loss: 0.481 loss_cls: 0.193 loss_box_reg: 0.285 time: 0.3588 data_time: 0.0090 lr: 0.000300 max_mem: 5200M
[09/30 05:59:28 c2.utils.dump.events]: eta: 0:00:14 iter: 89960/90000 total_loss: 0.482 loss_cls: 0.206 loss_box_reg: 0.293 time: 0.3588 data_time: 0.0070 lr: 0.000300 max_mem: 5200M
[09/30 05:59:35 c2.utils.dump.events]: eta: 0:00:07 iter: 89980/90000 total_loss: 0.507 loss_cls: 0.205 loss_box_reg: 0.297 time: 0.3588 data_time: 0.0104 lr: 0.000300 max_mem: 5200M
[09/30 05:59:43 c2.checkpoint.checkpoint]: Saving checkpoint to ./output/model_final.pth
[09/30 05:59:43 c2.data.build]: TransformGens used: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')] in testing
[09/30 05:59:44 c2.data.datasets.coco]: Loaded 5000 images in COCO format from /home/fsr/code/YOLOF/datasets/coco/annotations/instances_val2017.json
[09/30 05:59:44 c2.evaluation.evaluator]: Start inference on 2500 data samples
Traceback (most recent call last):
File "/home/fsr/code/YOLOF/tools/train_net.py", line 109, in
args=(args,),
File "/home/fsr/code/YOLOF/cvpods/engine/launch.py", line 53, in launch
daemon=False,
File "/home/laocheng/anaconda3/envs/dh/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/laocheng/anaconda3/envs/dh/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/laocheng/anaconda3/envs/dh/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 136, in join
signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Traceback (most recent call last):
Traceback (most recent call last):
File "", line 1, in
File "", line 1, in
File "/home/laocheng/anaconda3/envs/dh/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
File "/home/laocheng/anaconda3/envs/dh/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
exitcode = _main(fd)
File "/home/laocheng/anaconda3/envs/dh/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
File "/home/laocheng/anaconda3/envs/dh/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
_pickle.UnpicklingError: pickle data was truncated

How can I solve this problem when you need to test after you save the file

receptive field

hello, figure4 in your paper shows the scales of receptive field, how to analyse the receptive field, can you release the code or tool,thx

question for uniform matcher

according to your code, the uniform matcher seems calculate the L1 distance between pred_bbox/anchor with target among batch imgs. but i think it should be computed within single img. another question is that i do not understand the fusion method of the anchor indices and the pred_box indices, why simply add the two indices?https://github.com/megvii-model/YOLOF/blob/61a8accf957dceef11ea8029f121922b5f60901e/playground/detection/coco/yolof/yolof_base/uniform_matcher.py#L77

label assignment problem

Thanks for the authors' contributions. I have the problem about uniform matching. For example, assuming that Two Objects are close, the anchors around the two adjacent objects will be assigned the information of the two object at the same time. How to solve this problem ?

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [121,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

hello, when i train the model on 4 GPU, i met the following error, if train it on 1 gpu, the error disappear:

[04/08 14:39:50 c2.utils.dump.events]: eta: 4:24:06 iter: 6960/22500 total_loss: 0.748 loss_cls: 0.333 loss_box_reg: 0.420 time: 1.0219 data_time: 0.6507 lr: 0.010000 max_mem: 5233M
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [121,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
ERROR [04/08 14:39:54 c2.engine.base_runner]: Exception during training:
Traceback (most recent call last):
File "/media/6855ca5f-2432-4ace-ab31-3877011231fc/CODE_detection/YOLOF/cvpods/engine/base_runner.py", line 84, in train
self.run_step()
File "/media/6855ca5f-2432-4ace-ab31-3877011231fc/CODE_detection/YOLOF/cvpods/engine/base_runner.py", line 185, in run_step
loss_dict = self.model(data)
File "/home/env/python3.6env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/env/python3.6env/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/env/python3.6env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "../yolof_base/yolof.py", line 135, in forward
pred_logits, pred_anchor_deltas)
File "../yolof_base/yolof.py", line 216, in losses
pred_class_logits[valid_idxs],
RuntimeError: copy_if failed to synchronize: device-side assert triggered
[04/08 14:39:54 c2.engine.hooks]: Overall training speed: 6961 iterations in 1:58:34 (1.0221 s / it)
[04/08 14:39:54 c2.engine.hooks]: Total training time: 2:04:42 (0:06:08 on hooks)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)

train ours datasets

[code implementation] losses in yolof.py

您好，感谢您的工作。我在调试yolof代码中有些有一些不理解希望得到您的帮助。

Object detection w/ video

Is there a simple bash command for object detection based on video ?

What's the difference between YOLOF_CSP_D_53_DC5_9x_stage2_3x and YOLOF_CSP_D_53_DC5_9x?

Thanks for sharing your research!

I have one question.

What's the difference between YOLOF_CSP_D_53_DC5_9x_stage2_3x and YOLOF_CSP_D_53_DC5_9x?

How can i run this project in a single gpu

I delete IMS_PER_DEVICE, and change IMS_PER_BATCH from 64 to 16 and use pods_train --num_gpu 1 to train. Then Default process group is not initialized is reported. It seems to YOLOF use SyncBN, but I didn't find it in config.py. Thanks.

about accuracy

I use res50-C5
when I don't use the pre-training model, the accuracy is only 59%. After using the pre-training model, the accuracy reaches 80%.
However, my experiment on YOLOv5s is only 75%.
Should I use the pre-training model? What may be the reason for my situation?

TypeError: init() missing 2 required positional arguments: 'cfg' and 'distributed'

COCOEvaluator(dataset_name, output_dir=output_folder))
TypeError: init() missing 2 required positional arguments: 'cfg' and 'distributed'

训练可以进行但是输入验证集出现了这问题

YOLOF/cvpods/_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Z39tree_filter_refine_backward_edge_weightRKN2at6TensorES2_S2_S2_S2_S2_S2_S2_S2_

when I run
cd playground/detection/coco/yolof/yolof.res50.C5.1x
pods_train --num-gpus 8
i get a problem:
YOLOF/cvpods/_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: Z39tree_filter_refine_backward_edge_weightRKN2at6TensorES2_S2_S2_S2_S2_S2_S2_S2
How should I solve it?

Cannot get the AP result in the paper

Hello! I'm trying to re-train this YOLOF in a single 3090 GPU with the batchsize of 32. However, even though I tried to modify some of the parameters like base-lr, steps and something else, the best AP I can get is 35.7688, which is much worse than the AP of 37.7 in your paper. Do you have any suggestions about how to modify base-learning rate, steps and warmup settings?
Thank you!

How can I run a demo to evaluate the result?

I have trained the model.And I want to run your model by using some video exaples to evaluate the result. How to modify the code?

	def _init_weight(self):
	weight_init.c2_xavier_fill(self.lateral_conv)
	weight_init.c2_xavier_fill(self.fpn_conv)
	for m in [self.lateral_norm, self.fpn_norm]:
	nn.init.constant_(m.weight, 1)
	nn.init.constant_(m.bias, 0)
	for m in self.dilated_encoder_blocks.modules():
	if isinstance(m, nn.Conv2d):
	nn.init.normal_(m.weight, mean=0, std=0.01)
	if hasattr(m, 'bias') and m.bias is not None:
	nn.init.constant_(m.bias, 0)

	if isinstance(m, (nn.GroupNorm, nn.BatchNorm2d, nn.SyncBatchNorm)):
	nn.init.constant_(m.weight, 1)
	nn.init.constant_(m.bias, 0)

	ignore_idx = ious > self.neg_ignore_thresh
	pos_ious = torch.cat(pos_ious)
	pos_ignore_idx = pos_ious < self.pos_ignore_thresh

	src_idx = torch.cat(
	[src + idx * anchors[0].tensor.shape[0] for idx, (src, _) in
	enumerate(indices)])
	gt_classes = torch.full(pred_class_logits.shape[:1],
	self.num_classes,
	dtype=torch.int64,
	device=pred_class_logits.device)
	gt_classes[ignore_idx] = -1
	target_classes_o = torch.cat(
	[t.gt_classes[J] for t, (_, J) in zip(gt_instances, indices)])
	target_classes_o[pos_ignore_idx] = -1
	gt_classes[src_idx] = target_classes_o

	valid_idxs = gt_classes >= 0
	foreground_idxs = (gt_classes >= 0) & (gt_classes != self.num_classes)
	num_foreground = foreground_idxs.sum()

	gt_classes_target = torch.zeros_like(pred_class_logits)
	gt_classes_target[foreground_idxs, gt_classes[foreground_idxs]] = 1

	dist.all_reduce(num_foreground)
	num_foreground = num_foreground * 1.0 / dist.get_world_size()

	# cls loss
	loss_cls = sigmoid_focal_loss_jit(
	pred_class_logits[valid_idxs],
	gt_classes_target[valid_idxs],
	alpha=self.focal_loss_alpha,
	gamma=self.focal_loss_gamma,
	reduction="sum",
	)
	# reg loss
	target_boxes = torch.cat(
	[t.gt_boxes.tensor[i] for t, (_, i) in zip(gt_instances, indices)],
	dim=0)
	target_boxes = target_boxes[~pos_ignore_idx]
	matched_predicted_boxes = predicted_boxes.reshape(-1, 4)[
	src_idx[~pos_ignore_idx]]
	loss_box_reg = (1 - torch.diag(generalized_box_iou(
	matched_predicted_boxes, target_boxes))).sum()

	return {