facebookresearch / detr Goto Github PK

End-to-End Object Detection with Transformers

License: Apache License 2.0

Python 99.81% Dockerfile 0.19%

detr's Introduction

DE⫶TR: End-to-End Object Detection with Transformers

PyTorch training code and pretrained models for DETR (DEtection TRansformer). We replace the full complex hand-crafted object detection pipeline with a Transformer, and match Faster R-CNN with a ResNet-50, obtaining 42 AP on COCO using half the computation power (FLOPs) and the same number of parameters. Inference in 50 lines of PyTorch.

What it is. Unlike traditional computer vision techniques, DETR approaches object detection as a direct set prediction problem. It consists of a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Due to this parallel nature, DETR is very fast and efficient.

About the code. We believe that object detection should not be more difficult than classification, and should not require complex libraries for training and inference. DETR is very simple to implement and experiment with, and we provide a standalone Colab Notebook showing how to do inference with DETR in only a few lines of PyTorch code. Training code follows this idea - it is not a library, but simply a main.py importing model and criterion definitions with standard training loops.

Additionnally, we provide a Detectron2 wrapper in the d2/ folder. See the readme there for more information.

For details see End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.

See our blog post to learn more about end to end object detection with transformers.

Model Zoo

We provide baseline DETR and DETR-DC5 models, and plan to include more in future. AP is computed on COCO 2017 val5k, and inference time is over the first 100 val5k COCO images, with torchscript transformer.

	name	backbone	schedule	inf_time	box AP	url	size
0	DETR	R50	500	0.036	42.0	model \| logs	159Mb
1	DETR-DC5	R50	500	0.083	43.3	model \| logs	159Mb
2	DETR	R101	500	0.050	43.5	model \| logs	232Mb
3	DETR-DC5	R101	500	0.097	44.9	model \| logs	232Mb

COCO val5k evaluation results can be found in this gist.

The models are also available via torch hub, to load DETR R50 with pretrained weights simply do:

model = torch.hub.load('facebookresearch/detr:main', 'detr_resnet50', pretrained=True)

COCO panoptic val5k models:

	name	backbone	box AP	segm AP	PQ	url	size
0	DETR	R50	38.8	31.1	43.4	download	165Mb
1	DETR-DC5	R50	40.2	31.9	44.6	download	165Mb
2	DETR	R101	40.1	33	45.1	download	237Mb

Checkout our panoptic colab to see how to use and visualize DETR's panoptic segmentation prediction.

Notebooks

We provide a few notebooks in colab to help you get a grasp on DETR:

DETR's hands on Colab Notebook: Shows how to load a model from hub, generate predictions, then visualize the attention of the model (similar to the figures of the paper)
Standalone Colab Notebook: In this notebook, we demonstrate how to implement a simplified version of DETR from the grounds up in 50 lines of Python, then visualize the predictions. It is a good starting point if you want to gain better understanding the architecture and poke around before diving in the codebase.
Panoptic Colab Notebook: Demonstrates how to use DETR for panoptic segmentation and plot the predictions.

Usage - Object detection

There are no extra compiled components in DETR and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/facebookresearch/detr.git

Then, install PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

That's it, should be good to train and evaluate detection models.

(optional) to work with panoptic install panopticapi:

pip install git+https://github.com/cocodataset/panopticapi.git

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

To train baseline DETR on a single node with 8 gpus for 300 epochs run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco

A single epoch takes 28 minutes, so 300 epoch training takes around 6 days on a single machine with 8 V100 cards. To ease reproduction of our results we provide results and training logs for 150 epoch schedule (3 days on a single machine), achieving 39.5/60.3 AP/AP50.

We train DETR with AdamW setting learning rate in the transformer to 1e-4 and 1e-5 in the backbone. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.

Evaluation

To evaluate DETR R50 on COCO val5k with a single GPU run:

python main.py --batch_size 2 --no_aux_loss --eval --resume https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth --coco_path /path/to/coco

We provide results for all DETR detection models in this gist. Note that numbers vary depending on batch size (number of images) per GPU. Non-DC5 models were trained with batch size 2, and DC5 with 1, so DC5 models show a significant drop in AP if evaluated with more than 1 image per GPU.

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Train baseline DETR-6-6 model on 4 nodes for 300 epochs:

python run_with_submitit.py --timeout 3000 --coco_path /path/to/coco

Usage - Segmentation

We show that it is relatively straightforward to extend DETR to predict segmentation masks. We mainly demonstrate strong panoptic segmentation results.

Data preparation

For panoptic segmentation, you need the panoptic annotations additionally to the coco dataset (see above for the coco dataset). You need to download and extract the annotations. We expect the directory structure to be the following:

path/to/coco_panoptic/
  annotations/  # annotation json files
  panoptic_train2017/    # train panoptic annotations
  panoptic_val2017/      # val panoptic annotations

Training

We recommend training segmentation in two stages: first train DETR to detect all the boxes, and then train the segmentation head. For panoptic segmentation, DETR must learn to detect boxes for both stuff and things classes. You can train it on a single node with 8 gpus for 300 epochs with:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco  --coco_panoptic_path /path/to/coco_panoptic --dataset_file coco_panoptic --output_dir /output/path/box_model

For instance segmentation, you can simply train a normal box model (or used a pre-trained one we provide).

Once you have a box model checkpoint, you need to freeze it, and train the segmentation head in isolation. For panoptic segmentation you can train on a single node with 8 gpus for 25 epochs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --masks --epochs 25 --lr_drop 15 --coco_path /path/to/coco  --coco_panoptic_path /path/to/coco_panoptic  --dataset_file coco_panoptic --frozen_weights /output/path/box_model/checkpoint.pth --output_dir /output/path/segm_model

For instance segmentation only, simply remove the dataset_file and coco_panoptic_path arguments from the above command line.

License

DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

detr's People

Contributors

Stargazers

Watchers

Forkers

lliai templeblock peternara waterbearbee alexander-kirillov satoshirobatofujimoto seriousran crazyvertigo pankajmehar wishgale anikily sdwivedi ml-and-ai-repo zhuxiongwei24 sharpiless leaderj1001 openseg-group huolala2020 shafiahmed hsuanshao prohit93 sts-sadr marcelomata hajungong007 gokulsg amit2016-17 nikolayvoronchikhin yutingzhang raghunanj makozi leule rozgo bruinxiong anandpawara pouyaramezani anotherother rafmacalaba lvxingvir phuocddat xiaolaodi mrbananahuman cesarriat lld533 shengzhang90 killsking bino282 fengxingxiang laksh9950 baodijun zhanqiqi elephantgit dataxujing gaoalexander youngbaby123 ruolingdeng hsj307 ruiningtang dimplesl duyuankai1992 chaos1992 tylerjoe zccoder dreadlord1984 hzhang57 hdocmsu wolfworld6 jryongithub jeozhao luckydog5 coderhss chaoso hitleon tkhan3 zengpeng175 ingeniousfrog goswamig vuutla freewind2016 jzw0025 herolin12 lingeo zyg11 zumbalamambo happog aliushn liuwenhaha linhduongtuan li10141110 manikant92 abner-lp trasse absorbguo ccthompson82 abhijitdalavi chkswiftly pancakeawesome zebrajack apacha tongcu djinori

detr's Issues

What is mean of self[0] and self[1] in Joiner class?

class Joiner(nn.Sequential):
def init(self, backbone, position_embedding):
super().init(backbone, position_embedding)

def forward(self, tensor_list):
    xs = self[0](tensor_list)
    out = []
    pos = []
    for name, x in xs.items():
        out.append(x)
        # position encoding
        pos.append(self[1](x).to(x.tensors.dtype))

    return out, pos

What is the meaning of self[0] and self[1] here?

Many thanks.

Recommendations for training Detr on custom dataset?

Very impressed with the all new innovative architecture in Detr!
Can you clarify recommendations for training on a custom dataset?
Should we build a model similar to demo and train, or better to use and fine tune a full coco pretrained model and adjust the linear layer to desired class count?
Thanks in advance for any input.

Any plan about data augmentation

❓ Is there any plan about data augmentation

Hi, DETR teams,

According to the implementation in ultralytics/yolov3#310 (comment), and a similar are discussed in AlexeyAB/darknet#3114 (comment). It seems that such augmentation such as the mosaic techniques is helpful to detect smaller size object. I quote Jocher's conclusions below.

The smaller cars are detected earlier with less blinking and cars of all sizes show better behaved bounding boxes.

I check make_coco_transforms of this repo, and visualized the augmented images and labels in
VOC dataset (use the same config of make_coco_transforms here). Since the utilization of RandomSizeCrop, all the labels associated to an image may be cropped. (So this repo supports training with no targets in an image? 🤔️)

I want to know whether is there some plan about data augmentation.

Thank you!

How to train 16 GPUs using slurm?

Can you share the code to run 16 GPUS over 2 nodes using slurm?

Stucked at the beginning of training

Hi, I am trying to run the DETR on my local machine. But both training process gets stuck at the beginning stage, as follows

I am using Pytorch 1.5 and torchvision 0.6. And the faster-rcnn model can be trained on the coco dataset wihtout the problem.

I am wondering the problem may come from the Dataloader part. Could you provide some hints on this ? Thanks!

Training with 16 GPU need to double the lr and lr_backbone

Sorry for the naive question.

However, this is never made clear in the paper.

When scaling 8 GPU to 16 GPU, I guess we need to double the learning rate accordingly?

Problem with aspect rations in colab demo

Hi!

The colab demo doesn't seem to be working with images with wide aspect ratio (e.g. 16:9). The resulting bounding boxes are shifted to the right a bit and sometimes the inference crashes with a RuntimeError. Please see this colab notebook.
The bounding boxes look good after I change T.Resize(800) to something explicit, like T.Resize((800,600)). But I'm not sure if that's the correct way of addressing that (see the giraffe detections in my notebook). What would be a correct way of dealing with different aspect ratios?

The only things I changed are the urls of the input images and the transformation pipeline (in the second case) 🙂

AttributeError: module 'torch.distributed' has no attribute 'init_process_group'

I'm trying to run the example as-is, and i'm running into this issue. I did have to adjust the number of gpus because the VM I'm working on only has 1. I'm also working on a Windows 10 machine with pytorch version 1.5.0, CUDA version 10.1, and CUDA compiler driver v10.0.130.

| distributed init (rank 0): env://
Traceback (most recent call last):
  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 106, in main
    utils.init_distributed_mode(args)
  File "C:\Users\-user-\Documents\Projects\detr\util\misc.py", line 374, in init_distributed_mode
    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
Traceback (most recent call last):
  File "C:\Anaconda\envs\detr\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Anaconda\envs\detr\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Anaconda\envs\detr\lib\site-packages\torch\distributed\launch.py", line 263, in <module>
    main()
  File "C:\Anaconda\envs\detr\lib\site-packages\torch\distributed\launch.py", line 258, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['C:\\Anaconda\\envs\\detr\\python.exe', '-u', 'main.py', '--coco_path', 'F:/coco-data']' returned non-zero exit status 1.```

please change "--output-dir" to "--output_dir" in main.py.

Custom Object Detection Training

❓ How to do something using DETR

Describe what you want to do, including:

what inputs you will provide, if any:
what outputs you are expecting:

NOTE:

Only general answers are provided.
If you want to ask about "why X did not work", please use the
Unexpected behaviors issue template.
About how to implement new models / new dataloader / new training logic, etc., check documentation first.
We do not answer general machine learning / computer vision questions that are not specific to DETR, such as how a model works, how to improve your training/make it converge, or what algorithm/methods can be used to achieve X.

How to train a new model for Custom object detection in google colab.

AccessDenied when downloading panoptic models

I get an AccessDenied Error when I try to download the panoptic model .pth files linked in the readme. Url for normal models work fine.

inference one image

hi, how to inference one image with detr?

Help on Training a Small custom data set

Hi Team,
I am.working on a custom dataset , which has 7 classes and I have 1500 images , I wanna train on DeTR, help me out how should I train the model

Thanks in advance

Strage behavior potententially bug.

I am trying to adopt this repository to OCR task and facing same dilemma

While training you have 3 different sizes of image encoded in dataset

Actual tensor size
filed 'size' - which means what?
field 'orig_size' - which I believe means original size of an image in the dataset

So if you try to print boxes of a dataset of an element for batch size bigger than 1 (I check it with 5)
You will get behaviour there for the same picture due to random batch sampling will have different box coordinates

Look below. This function will print you boxes on image correctly but only in case of batch_size=1 or if all pictures in your dataset are the same size or in case if you use for W, H scaling from target["size"] which is wrong.

# img = (3, H, W) tensor from batch with the samme H and W
# target - labels for this particular image
def showImageFromBatch(img, target):
    from PIL import Image, ImageDraw, ImageFont

    draw = ImageDraw.Draw(img)
    boxes = target['boxes']
    cl = target['labels']

    if 1:#boxes.max() <= 1:
        boxes = box_cxcywh_to_xyxy(boxes)

        print('Image:', (img.height, img.width), target['size'], target['orig_size'])


        H, W = target['size'] <<< Works well only with that
        W, H = img.width, img.height <<< But must works with this!!!

        boxes[:, 0::2] *= W
        boxes[:, 1::2] *= H

    for i in range(len(boxes)):
        x1, y1, x2, y2 = boxes[i]
        draw.rectangle((x1, y1, x2, y2), outline=(0, 255, 0) if cl[i] >= 0 else (0, 0, 0), width=3)
        draw.text((x1, y1), str(cl[i].item()), (0, 255, 0) if cl[i] >= 0 else (0, 0, 0),
                  font=ImageFont.truetype("DejaVuSansMono.ttf", 20))
    img.show()

Please clarify this situation. Thank you in advance.

High class error

Hi, so I've tried training with a personal dataset and COCO2017 for a sanity check. My class_error stays at 100.00 for most of the training, with very few 75 - 100 errors. I average around 99 for my class errors after a couple of epochs for both training and validation scores (both my dataset and COCO2017). Wanted to know if anyone has experienced similar issues?

To add, I only changed num_queries flag for my personal dataset. COCO2017 kept its original arguments. My loss does seem to drop, however. Any direction would be greatly appreciated!

Need for Speed

Hi, just a question on speed. The reported inference speeds are on which GPU? Tesla V100 or something less powerful? Thanks

Landmark regression

Have you experimented this with landmarks/joints regression other then bounding boxes?
Some of the methods mentioned in the paper like "Object as points" was applied also on these joint tasks.

ECCV submission ID is visible in ArXiv pdf

the paper is unavailable

Question about residual connection

Hi, thank you so much for your work!

I have one question about the self-attention implementation. In the paper Attention is All You Need, the residual connection is made upon input embeddings + positional encoding as shown in the figure below.

In the paper, the figure seems to match above as shown in the paper below.

However, in the code, it looks to me that the residual connection is made upon input embeddings only (the src), also see figure below. Is this a mistake or there is a reason for such modification? Thank you!

ModuleNotFoundError: No module named 'util'

Hi and thanks for the code!
When I try to load detr it gives :

from detr.models import detr


 7 from torch import nn
      8 
----> 9 from util import box_ops
     10 from util.misc import (NestedTensor, accuracy, get_world_size, interpolate,
     11                        is_dist_avail_and_initialized)

ModuleNotFoundError: No module named 'util'

from detr.engine import evaluate

10 import torch
     11 
---> 12 import util.misc as utils
     13 from datasets.coco_eval import CocoEvaluator
     14 from datasets.panoptic_eval import PanopticEvaluator

ModuleNotFoundError: No module named 'util'

Integration with Detectron2

First of all, excellent work!

I know that probably the integration with detectron2 is not automatic since there are differences in the training architecture compared with the default detectron2 procedure. But there are any plans to integrate DETR in detectron2?

Thank you!

use plot_logs function from plot_utils - errors if you only have a single log.txt?

❓ How to do something using DETR

Describe what you want to do, including:

what inputs you will provide, if any:
log files from training
what outputs you are expecting:
plots of the various losses

I'm trying to make use of the plot_logs function in /util/plot_utils.py
In Jupyter, I'm passing a Pathlib to the dir with my log.txt...but that immediately generates a TypeError: 'PosixPath' object is not iterable...which makes sense, I'm just passing in the dir of the single log.txt, so nothing to iterate.
I changed the code to not iterate and just read the single file into a df and ultimately got the graphs to print...but clearly I'm not calling it correctly?
Is there a preferred way to call/plot a single log file without removing all the list comprehensions?

NOTE:

Only general answers are provided.
If you want to ask about "why X did not work", please use the
Unexpected behaviors issue template.
About how to implement new models / new dataloader / new training logic, etc., check documentation first.
We do not answer general machine learning / computer vision questions that are not specific to DETR, such as how a model works, how to improve your training/make it converge, or what algorithm/methods can be used to achieve X.

Why 'num_classes=91'?

Hi, great work.
I read your code but I found you set 'num_classes=91' for coco detection.
But coco detection has 80 categories. May you explain why you set this=91?
Thanks very much~

Training with resnet18

❓ How to train DETR with resnet18 backbone?

Describe what you want to do, including:

I'm trying to run training on my 2080ti with resnet18 backbone and getting an error
I started with default command, but end up with it still unsuccessfully: python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --num_queries 2000 --pre_norm --masks --output_dir output --eval --num_workers 4 --enc_layers 2 --dec_layers 2 --dim_feedforward 512 --backbone resnet18 --hidden_dim 128
I'm detecting small round like objects with a simple background. There're 200-2000 objects per image.

Could you, please, help me run with resnet18? Any advice regarding optimal parameters to start for my task are appreciated!

Traceback:

  File "main.py", line 248, in <module>
    main(args)
  File "main.py", line 186, in main
    data_loader_val, base_ds, device, args.output_dir)
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Documents/repos/detr/engine.py", line 92, in evaluate
    outputs = model(samples)
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/repos/detr/models/segmentation.py", line 57, in forward
    seg_masks = self.mask_head(src_proj, bbox_mask, [features[2].tensors, features[1].tensors, features[0].tensors])
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/repos/detr/models/segmentation.py", line 110, in forward
    cur_fpn = self.adapter1(fpns[0])
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 349, in forward
    return self._conv_forward(input, self.weight)
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 346, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [64, 1024, 1, 1], expected input[2, 256, 50, 50] to have 1024 channels, but got 256 channels instead
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.7.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/.pyenv/versions/3.7.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/user/.virtualenvs/jupyter/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)

Optimize the transformers

@alexholdenmiller @leaderj1001 @alcinos @snyxan
Thank you for your hard work,

Seeing the transformers learn to understand instances was truly amazing, your work is amazing.
Further research into optimization is vital in-order to make training and inferencing feasible for the
average person.

Is there a plan for optimizing detr, pruning, distilling, searching for better students, etc... ?

https://github.com/mit-han-lab/hardware-aware-transformers
https://github.com/mit-han-lab/gan-compression
http://news.mit.edu/2020/foolproof-way-shrink-deep-learning-models-0430

Unable to evaluate model

Environment
pytorch 1.3.1
torchvision 0.4.2

I am able to train the model successfully. However, the following mistake appear when I run the evaluation independetly.

srun --gres gpu:1 python main.py --batch_size 2 --no_aux_loss --eval --resume https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth --coco_path ../../dataset/

Traceback (most recent call last):
File "main.py", line 248, in
main(args)
File "main.py", line 106, in main
utils.init_distributed_mode(args)
File "/mnt/lustre/chenyuntao1/homes/gaopeng/mask_detr/detr/util/misc.py", line 416, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/mnt/lustre/chenyuntao1/homes/gaopeng/anaconda3/envs/detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/mnt/lustre/chenyuntao1/homes/gaopeng/anaconda3/envs/detr/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 130, in _env_rendezvous_handler
raise _env_error("MASTER_ADDR")
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

the model report on the paper is trained by 300 epoches or 500 epoches?

In your README, it seems that the final model is trained by 300 epochs with a learning rate drop at 200 epochs.

However, in the following link, it seems like 42.0 is trained by 500 epochs with a learning rate drop at 400 epochs.

Can you clarify?

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py
--lr_drop 400 --epochs 500
--coco_path /path/to/coco

https://gist.github.com/szagoruyko/9c9ebb8455610958f7deaa27845d7918

panoptic segmentation visualization

Very nice repo!

I want to do a simple inference with the panoptic segmentation model. How can I visualize the output of the panoptic model after the "panoptic post-processing"?

Thanks ;)

Question for training own dataset

Hi,
Thank you for your great job.
I have a question with training own dataset. The result for eval is always zeros like that:

**(base) [detr]$ python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --lr 1e-3 --batch_size 4 --epochs 10  --coco_path datasets/shape/coco**
| distributed init (rank 0): env://
git:
  sha: 0af41930d1b6c2244e33bbef76dff6c537dd53c0, status: clean, branch: master

**Namespace(aux_loss=True, backbone='resnet50', batch_size=4, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='datasets/shape/coco', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=10, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)**
number of params: 41302368
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Start training

Training log

Epoch: [0]  [  0/225]  eta: 0:02:30  lr: 0.001000  class_error: 100.00  loss: 75.9316 (75.9316)  loss_ce: 4.8402 (4.8402)  loss_bbox: 5.6168 (5.6168)  loss_giou: 2.2340 (2.2340)  loss_ce_0: 4.4001 (4.4001)  loss_bbox_0: 5.4950 (5.4950)  loss_giou_0: 2.2311 (2.2311)  loss_ce_1: 4.8179 (4.8179)  loss_bbox_1: 5.6163 (5.6163)  loss_giou_1: 2.2393 (2.2393)  loss_ce_2: 4.7843 (4.7843)  loss_bbox_2: 5.6247 (5.6247)  loss_giou_2: 2.2343 (2.2343)  loss_ce_3: 4.9645 (4.9645)  loss_bbox_3: 5.6222 (5.6222)  loss_giou_3: 2.2467 (2.2467)  loss_ce_4: 4.9737 (4.9737)  loss_bbox_4: 5.7800 (5.7800)  loss_giou_4: 2.2105 (2.2105)  loss_ce_unscaled: 4.8402 (4.8402)  class_error_unscaled: 100.0000 (100.0000)  loss_bbox_unscaled: 1.1234 (1.1234)  loss_giou_unscaled: 1.1170 (1.1170)  cardinality_error_unscaled: 96.7500 (96.7500)  loss_ce_0_unscaled: 4.4001 (4.4001)  loss_bbox_0_unscaled: 1.0990 (1.0990)  loss_giou_0_unscaled: 1.1155 (1.1155)  cardinality_error_0_unscaled: 96.7500 (96.7500)  loss_ce_1_unscaled: 4.8179 (4.8179)  loss_bbox_1_unscaled: 1.1233 (1.1233)  loss_giou_1_unscaled: 1.1197 (1.1197)  cardinality_error_1_unscaled: 96.7500 (96.7500)  loss_ce_2_unscaled: 4.7843 (4.7843)  loss_bbox_2_unscaled: 1.1249 (1.1249)  loss_giou_2_unscaled: 1.1172 (1.1172)  cardinality_error_2_unscaled: 96.7500 (96.7500)  loss_ce_3_unscaled: 4.9645 (4.9645)  loss_bbox_3_unscaled: 1.1244 (1.1244)  loss_giou_3_unscaled: 1.1234 (1.1234)  cardinality_error_3_unscaled: 96.7500 (96.7500)  loss_ce_4_unscaled: 4.9737 (4.9737)  loss_bbox_4_unscaled: 1.1560 (1.1560)  loss_giou_4_unscaled: 1.1053 (1.1053)  cardinality_error_4_unscaled: 96.7500 (96.7500)  time: 0.6674  data: 0.2966  max mem: 2899
Epoch: [0]  [ 10/225]  eta: 0:01:16  lr: 0.001000  class_error: 100.00  loss: 40.5536 (44.2869)  loss_ce: 0.8244 (1.3583)  loss_bbox: 3.0293 (3.3114)  loss_giou: 2.7750 (2.7253)  loss_ce_0: 0.8599 (1.2429)  loss_bbox_0: 3.0818 (3.3239)  loss_giou_0: 2.7982 (2.7456)  loss_ce_1: 0.8457 (1.3100)  loss_bbox_1: 3.1305 (3.3230)  loss_giou_1: 2.7961 (2.7431)  loss_ce_2: 0.8787 (1.3171)  loss_bbox_2: 3.0785 (3.3198)  loss_giou_2: 2.8003 (2.7389)  loss_ce_3: 0.8455 (1.3657)  loss_bbox_3: 3.0552 (3.3092)  loss_giou_3: 2.7829 (2.7311)  loss_ce_4: 0.8526 (1.3903)  loss_bbox_4: 3.0473 (3.3076)  loss_giou_4: 2.7943 (2.7239)  loss_ce_unscaled: 0.8244 (1.3583)  class_error_unscaled: 100.0000 (93.9394)  loss_bbox_unscaled: 0.6059 (0.6623)  loss_giou_unscaled: 1.3875 (1.3626)  cardinality_error_unscaled: 3.5000 (19.9773)  loss_ce_0_unscaled: 0.8599 (1.2429)  loss_bbox_0_unscaled: 0.6164 (0.6648)  loss_giou_0_unscaled: 1.3991 (1.3728)  cardinality_error_0_unscaled: 3.0000 (11.3636)  loss_ce_1_unscaled: 0.8457 (1.3100)  loss_bbox_1_unscaled: 0.6261 (0.6646)  loss_giou_1_unscaled: 1.3981 (1.3716)  cardinality_error_1_unscaled: 3.5000 (12.5909)  loss_ce_2_unscaled: 0.8787 (1.3171)  loss_bbox_2_unscaled: 0.6157 (0.6640)  loss_giou_2_unscaled: 1.4001 (1.3694)  cardinality_error_2_unscaled: 3.5000 (19.2500)  loss_ce_3_unscaled: 0.8455 (1.3657)  loss_bbox_3_unscaled: 0.6110 (0.6618)  loss_giou_3_unscaled: 1.3915 (1.3656)  cardinality_error_3_unscaled: 3.5000 (19.9773)  loss_ce_4_unscaled: 0.8526 (1.3903)  loss_bbox_4_unscaled: 0.6095 (0.6615)  loss_giou_4_unscaled: 1.3971 (1.3619)  cardinality_error_4_unscaled: 3.5000 (19.9773)  time: 0.3566  data: 0.0417  max mem: 4100
Epoch: [0]  [ 20/225]  eta: 0:01:11  lr: 0.001000  class_error: 100.00  loss: 38.5593 (40.8082)  loss_ce: 0.7449 (1.0389)  loss_bbox: 2.5801 (2.9656)  loss_giou: 2.8444 (2.8061)  loss_ce_0: 0.7395 (0.9770)  loss_bbox_0: 2.5890 (2.9779)  loss_giou_0: 2.8093 (2.8087)  loss_ce_1: 0.7514 (1.0102)  loss_bbox_1: 2.5790 (2.9720)  loss_giou_1: 2.8446 (2.8075)  loss_ce_2: 0.7420 (1.0158)  loss_bbox_2: 2.5879 (2.9603)  loss_giou_2: 2.8726 (2.8131)  loss_ce_3: 0.7451 (1.0461)  loss_bbox_3: 2.5694 (2.9725)  loss_giou_3: 2.8119 (2.8037)  loss_ce_4: 0.7562 (1.0560)  loss_bbox_4: 2.5707 (2.9682)  loss_giou_4: 2.8531 (2.8084)  loss_ce_unscaled: 0.7449 (1.0389)  class_error_unscaled: 100.0000 (96.8254)  loss_bbox_unscaled: 0.5160 (0.5931)  loss_giou_unscaled: 1.4222 (1.4031)  cardinality_error_unscaled: 2.7500 (11.7857)  loss_ce_0_unscaled: 0.7395 (0.9770)  loss_bbox_0_unscaled: 0.5178 (0.5956)  loss_giou_0_unscaled: 1.4046 (1.4043)  cardinality_error_0_unscaled: 2.7500 (7.2738)  loss_ce_1_unscaled: 0.7514 (1.0102)  loss_bbox_1_unscaled: 0.5158 (0.5944)  loss_giou_1_unscaled: 1.4223 (1.4037)  cardinality_error_1_unscaled: 2.7500 (7.9167)  loss_ce_2_unscaled: 0.7420 (1.0158)  loss_bbox_2_unscaled: 0.5176 (0.5921)  loss_giou_2_unscaled: 1.4363 (1.4066)  cardinality_error_2_unscaled: 2.7500 (11.4048)  loss_ce_3_unscaled: 0.7451 (1.0461)  loss_bbox_3_unscaled: 0.5139 (0.5945)  loss_giou_3_unscaled: 1.4059 (1.4019)  cardinality_error_3_unscaled: 2.7500 (11.7857)  loss_ce_4_unscaled: 0.7562 (1.0560)  loss_bbox_4_unscaled: 0.5141 (0.5936)  loss_giou_4_unscaled: 1.4265 (1.4042)  cardinality_error_4_unscaled: 2.7500 (11.7857)  time: 0.3333  data: 0.0155  max mem: 4763
Epoch: [0]  [ 30/225]  eta: 0:01:06  lr: 0.001000  class_error: 100.00  loss: 36.1775 (39.1106)  loss_ce: 0.6629 (0.9151)  loss_bbox: 2.5098 (2.8222)  loss_giou: 2.8313 (2.7752)  loss_ce_0: 0.6502 (0.8715)  loss_bbox_0: 2.5458 (2.8372)  loss_giou_0: 2.8088 (2.7885)  loss_ce_1: 0.6497 (0.8934)  loss_bbox_1: 2.5484 (2.8363)  loss_giou_1: 2.7917 (2.7771)  loss_ce_2: 0.6553 (0.9005)  loss_bbox_2: 2.4740 (2.8460)  loss_giou_2: 2.8586 (2.7895)  loss_ce_3: 0.6577 (0.9152)  loss_bbox_3: 2.5694 (2.8280)  loss_giou_3: 2.8119 (2.7908)  loss_ce_4: 0.6433 (0.9227)  loss_bbox_4: 2.5183 (2.8121)  loss_giou_4: 2.8352 (2.7893)  loss_ce_unscaled: 0.6629 (0.9151)  class_error_unscaled: 100.0000 (97.8495)  loss_bbox_unscaled: 0.5020 (0.5644)  loss_giou_unscaled: 1.4156 (1.3876)  cardinality_error_unscaled: 2.7500 (8.8468)  loss_ce_0_unscaled: 0.6502 (0.8715)  loss_bbox_0_unscaled: 0.5092 (0.5674)  loss_giou_0_unscaled: 1.4044 (1.3943)  cardinality_error_0_unscaled: 2.7500 (5.7903)  loss_ce_1_unscaled: 0.6497 (0.8934)  loss_bbox_1_unscaled: 0.5097 (0.5673)  loss_giou_1_unscaled: 1.3958 (1.3885)  cardinality_error_1_unscaled: 2.7500 (6.2258)  loss_ce_2_unscaled: 0.6553 (0.9005)  loss_bbox_2_unscaled: 0.4948 (0.5692)  loss_giou_2_unscaled: 1.4293 (1.3948)  cardinality_error_2_unscaled: 2.7500 (8.5887)  loss_ce_3_unscaled: 0.6577 (0.9152)  loss_bbox_3_unscaled: 0.5139 (0.5656)  loss_giou_3_unscaled: 1.4059 (1.3954)  cardinality_error_3_unscaled: 2.7500 (8.8468)  loss_ce_4_unscaled: 0.6433 (0.9227)  loss_bbox_4_unscaled: 0.5037 (0.5624)  loss_giou_4_unscaled: 1.4176 (1.3947)  cardinality_error_4_unscaled: 2.7500 (8.8468)  time: 0.3356  data: 0.0145  max mem: 5477
Epoch: [0]  [ 40/225]  eta: 0:01:03  lr: 0.001000  class_error: 100.00  loss: 32.6019 (36.9908)  loss_ce: 0.6956 (0.8742)  loss_bbox: 2.1999 (2.5978)  loss_giou: 2.5164 (2.6636)  loss_ce_0: 0.6839 (0.8402)  loss_bbox_0: 2.2153 (2.6161)  loss_giou_0: 2.4194 (2.6766)  loss_ce_1: 0.7205 (0.8589)  loss_bbox_1: 2.1482 (2.6176)  loss_giou_1: 2.4094 (2.6681)  loss_ce_2: 0.6871 (0.8615)  loss_bbox_2: 2.3961 (2.6406)  loss_giou_2: 2.5275 (2.6903)  loss_ce_3: 0.6944 (0.8725)  loss_bbox_3: 2.1964 (2.6045)  loss_giou_3: 2.5396 (2.6689)  loss_ce_4: 0.6961 (0.8798)  loss_bbox_4: 2.2572 (2.6533)  loss_giou_4: 2.6218 (2.7062)  loss_ce_unscaled: 0.6956 (0.8742)  class_error_unscaled: 100.0000 (98.3740)  loss_bbox_unscaled: 0.4400 (0.5196)  loss_giou_unscaled: 1.2582 (1.3318)  cardinality_error_unscaled: 2.7500 (7.5122)  loss_ce_0_unscaled: 0.6839 (0.8402)  loss_bbox_0_unscaled: 0.4431 (0.5232)  loss_giou_0_unscaled: 1.2097 (1.3383)  cardinality_error_0_unscaled: 2.7500 (5.2012)  loss_ce_1_unscaled: 0.7205 (0.8589)  loss_bbox_1_unscaled: 0.4296 (0.5235)  loss_giou_1_unscaled: 1.2047 (1.3341)  cardinality_error_1_unscaled: 2.7500 (5.5305)  loss_ce_2_unscaled: 0.6871 (0.8615)  loss_bbox_2_unscaled: 0.4792 (0.5281)  loss_giou_2_unscaled: 1.2638 (1.3451)  cardinality_error_2_unscaled: 2.7500 (7.3171)  loss_ce_3_unscaled: 0.6944 (0.8725)  loss_bbox_3_unscaled: 0.4393 (0.5209)  loss_giou_3_unscaled: 1.2698 (1.3345)  cardinality_error_3_unscaled: 2.7500 (7.5122)  loss_ce_4_unscaled: 0.6961 (0.8798)  loss_bbox_4_unscaled: 0.4514 (0.5307)  loss_giou_4_unscaled: 1.3109 (1.3531)  cardinality_error_4_unscaled: 2.7500 (7.5122)  time: 0.3336  data: 0.0145  max mem: 5477
Epoch: [0]  [ 50/225]  eta: 0:00:59  lr: 0.001000  class_error: 100.00  loss: 27.3266 (34.7739)  loss_ce: 0.7699 (0.8480)  loss_bbox: 1.7696 (2.3824)  loss_giou: 2.1386 (2.5234)  loss_ce_0: 0.7753 (0.8237)  loss_bbox_0: 1.7433 (2.4192)  loss_giou_0: 2.1408 (2.5614)  loss_ce_1: 0.7667 (0.8363)  loss_bbox_1: 1.7529 (2.4163)  loss_giou_1: 2.1323 (2.5346)  loss_ce_2: 0.7698 (0.8400)  loss_bbox_2: 1.7657 (2.4392)  loss_giou_2: 2.2232 (2.5671)  loss_ce_3: 0.7478 (0.8485)  loss_bbox_3: 1.6155 (2.3823)  loss_giou_3: 2.0623 (2.5389)  loss_ce_4: 0.7658 (0.8536)  loss_bbox_4: 1.6977 (2.4118)  loss_giou_4: 2.0993 (2.5472)  loss_ce_unscaled: 0.7699 (0.8480)  class_error_unscaled: 100.0000 (98.6928)  loss_bbox_unscaled: 0.3539 (0.4765)  loss_giou_unscaled: 1.0693 (1.2617)  cardinality_error_unscaled: 3.5000 (6.6765)  loss_ce_0_unscaled: 0.7753 (0.8237)  loss_bbox_0_unscaled: 0.3487 (0.4838)  loss_giou_0_unscaled: 1.0704 (1.2807)  cardinality_error_0_unscaled: 3.5000 (4.8186)  loss_ce_1_unscaled: 0.7667 (0.8363)  loss_bbox_1_unscaled: 0.3506 (0.4833)  loss_giou_1_unscaled: 1.0662 (1.2673)  cardinality_error_1_unscaled: 3.5000 (5.0833)  loss_ce_2_unscaled: 0.7698 (0.8400)  loss_bbox_2_unscaled: 0.3531 (0.4878)  loss_giou_2_unscaled: 1.1116 (1.2836)  cardinality_error_2_unscaled: 3.5000 (6.5196)  loss_ce_3_unscaled: 0.7478 (0.8485)  loss_bbox_3_unscaled: 0.3231 (0.4765)  loss_giou_3_unscaled: 1.0311 (1.2694)  cardinality_error_3_unscaled: 3.5000 (6.6765)  loss_ce_4_unscaled: 0.7658 (0.8536)  loss_bbox_4_unscaled: 0.3395 (0.4824)  loss_giou_4_unscaled: 1.0496 (1.2736)  cardinality_error_4_unscaled: 3.5000 (6.6765)  time: 0.3331  data: 0.0146  max mem: 5477
Epoch: [0]  [ 60/225]  eta: 0:00:55  lr: 0.001000  class_error: 100.00  loss: 23.1808 (32.6448)  loss_ce: 0.6915 (0.8191)  loss_bbox: 1.2613 (2.2106)  loss_giou: 1.8641 (2.3880)  loss_ce_0: 0.7082 (0.7989)  loss_bbox_0: 1.3390 (2.2320)  loss_giou_0: 1.7792 (2.4135)  loss_ce_1: 0.7016 (0.8092)  loss_bbox_1: 1.2421 (2.2059)  loss_giou_1: 1.7174 (2.3772)  loss_ce_2: 0.6996 (0.8123)  loss_bbox_2: 1.5305 (2.2715)  loss_giou_2: 1.8414 (2.4372)  loss_ce_3: 0.7185 (0.8226)  loss_bbox_3: 1.3289 (2.1920)  loss_giou_3: 1.7667 (2.3846)  loss_ce_4: 0.6862 (0.8221)  loss_bbox_4: 1.3238 (2.2314)  loss_giou_4: 1.8016 (2.4168)  loss_ce_unscaled: 0.6915 (0.8191)  class_error_unscaled: 100.0000 (98.9071)  loss_bbox_unscaled: 0.2523 (0.4421)  loss_giou_unscaled: 0.9320 (1.1940)  cardinality_error_unscaled: 3.0000 (6.0123)  loss_ce_0_unscaled: 0.7082 (0.7989)  loss_bbox_0_unscaled: 0.2678 (0.4464)  loss_giou_0_unscaled: 0.8896 (1.2068)  cardinality_error_0_unscaled: 3.0000 (4.4590)  loss_ce_1_unscaled: 0.7016 (0.8092)  loss_bbox_1_unscaled: 0.2484 (0.4412)  loss_giou_1_unscaled: 0.8587 (1.1886)  cardinality_error_1_unscaled: 3.0000 (4.6803)  loss_ce_2_unscaled: 0.6996 (0.8123)  loss_bbox_2_unscaled: 0.3061 (0.4543)  loss_giou_2_unscaled: 0.9207 (1.2186)  cardinality_error_2_unscaled: 3.0000 (5.8811)  loss_ce_3_unscaled: 0.7185 (0.8226)  loss_bbox_3_unscaled: 0.2658 (0.4384)  loss_giou_3_unscaled: 0.8833 (1.1923)  cardinality_error_3_unscaled: 3.0000 (6.0123)  loss_ce_4_unscaled: 0.6862 (0.8221)  loss_bbox_4_unscaled: 0.2648 (0.4463)  loss_giou_4_unscaled: 0.9008 (1.2084)  cardinality_error_4_unscaled: 3.0000 (6.0123)  time: 0.3338  data: 0.0146  max mem: 5477
Epoch: [0]  [ 70/225]  eta: 0:00:52  lr: 0.001000  class_error: 100.00  loss: 21.3303 (30.9936)  loss_ce: 0.6637 (0.8001)  loss_bbox: 1.1000 (2.0557)  loss_giou: 1.5853 (2.2778)  loss_ce_0: 0.6661 (0.7836)  loss_bbox_0: 1.1435 (2.0696)  loss_giou_0: 1.6672 (2.3055)  loss_ce_1: 0.6682 (0.7913)  loss_bbox_1: 1.0864 (2.0510)  loss_giou_1: 1.5471 (2.2684)  loss_ce_2: 0.6681 (0.7951)  loss_bbox_2: 1.1686 (2.0983)  loss_giou_2: 1.5572 (2.3080)  loss_ce_3: 0.6903 (0.8043)  loss_bbox_3: 1.1644 (2.0551)  loss_giou_3: 1.6332 (2.2966)  loss_ce_4: 0.6541 (0.8009)  loss_bbox_4: 1.2588 (2.1030)  loss_giou_4: 1.7398 (2.3293)  loss_ce_unscaled: 0.6637 (0.8001)  class_error_unscaled: 100.0000 (99.0610)  loss_bbox_unscaled: 0.2200 (0.4111)  loss_giou_unscaled: 0.7926 (1.1389)  cardinality_error_unscaled: 2.2500 (5.5563)  loss_ce_0_unscaled: 0.6661 (0.7836)  loss_bbox_0_unscaled: 0.2287 (0.4139)  loss_giou_0_unscaled: 0.8336 (1.1528)  cardinality_error_0_unscaled: 2.2500 (4.2218)  loss_ce_1_unscaled: 0.6682 (0.7913)  loss_bbox_1_unscaled: 0.2173 (0.4102)  loss_giou_1_unscaled: 0.7736 (1.1342)  cardinality_error_1_unscaled: 2.2500 (4.4120)  loss_ce_2_unscaled: 0.6681 (0.7951)  loss_bbox_2_unscaled: 0.2337 (0.4197)  loss_giou_2_unscaled: 0.7786 (1.1540)  cardinality_error_2_unscaled: 2.2500 (5.4401)  loss_ce_3_unscaled: 0.6903 (0.8043)  loss_bbox_3_unscaled: 0.2329 (0.4110)  loss_giou_3_unscaled: 0.8166 (1.1483)  cardinality_error_3_unscaled: 2.2500 (5.5563)  loss_ce_4_unscaled: 0.6541 (0.8009)  loss_bbox_4_unscaled: 0.2518 (0.4206)  loss_giou_4_unscaled: 0.8699 (1.1647)  cardinality_error_4_unscaled: 2.2500 (5.5563)  time: 0.3378  data: 0.0146  max mem: 5477
Epoch: [0]  [ 80/225]  eta: 0:00:48  lr: 0.001000  class_error: 100.00  loss: 20.5233 (29.6911)  loss_ce: 0.6906 (0.7851)  loss_bbox: 1.0481 (1.9269)  loss_giou: 1.5147 (2.1781)  loss_ce_0: 0.7025 (0.7712)  loss_bbox_0: 1.0714 (1.9364)  loss_giou_0: 1.4821 (2.1900)  loss_ce_1: 0.6821 (0.7767)  loss_bbox_1: 1.1409 (1.9650)  loss_giou_1: 1.6625 (2.2206)  loss_ce_2: 0.6765 (0.7826)  loss_bbox_2: 1.0265 (1.9785)  loss_giou_2: 1.4500 (2.2144)  loss_ce_3: 0.6955 (0.7901)  loss_bbox_3: 1.0413 (1.9257)  loss_giou_3: 1.5327 (2.1867)  loss_ce_4: 0.6909 (0.7869)  loss_bbox_4: 1.2588 (2.0151)  loss_giou_4: 1.7385 (2.2610)  loss_ce_unscaled: 0.6906 (0.7851)  class_error_unscaled: 100.0000 (99.1770)  loss_bbox_unscaled: 0.2096 (0.3854)  loss_giou_unscaled: 0.7574 (1.0890)  cardinality_error_unscaled: 3.0000 (5.2191)  loss_ce_0_unscaled: 0.7025 (0.7712)  loss_bbox_0_unscaled: 0.2143 (0.3873)  loss_giou_0_unscaled: 0.7411 (1.0950)  cardinality_error_0_unscaled: 3.0000 (4.0494)  loss_ce_1_unscaled: 0.6821 (0.7767)  loss_bbox_1_unscaled: 0.2282 (0.3930)  loss_giou_1_unscaled: 0.8313 (1.1103)  cardinality_error_1_unscaled: 3.0000 (4.2160)  loss_ce_2_unscaled: 0.6765 (0.7826)  loss_bbox_2_unscaled: 0.2053 (0.3957)  loss_giou_2_unscaled: 0.7250 (1.1072)  cardinality_error_2_unscaled: 3.0000 (5.1173)  loss_ce_3_unscaled: 0.6955 (0.7901)  loss_bbox_3_unscaled: 0.2083 (0.3851)  loss_giou_3_unscaled: 0.7663 (1.0933)  cardinality_error_3_unscaled: 3.0000 (5.2191)  loss_ce_4_unscaled: 0.6909 (0.7869)  loss_bbox_4_unscaled: 0.2518 (0.4030)  loss_giou_4_unscaled: 0.8693 (1.1305)  cardinality_error_4_unscaled: 3.0000 (5.2191)  time: 0.3316  data: 0.0145  max mem: 5477
Epoch: [0]  [ 90/225]  eta: 0:00:45  lr: 0.001000  class_error: 100.00  loss: 20.0966 (28.7047)  loss_ce: 0.7220 (0.7781)  loss_bbox: 0.9469 (1.8428)  loss_giou: 1.4894 (2.1132)  loss_ce_0: 0.7117 (0.7650)  loss_bbox_0: 1.0125 (1.8465)  loss_giou_0: 1.4151 (2.1172)  loss_ce_1: 0.7015 (0.7705)  loss_bbox_1: 1.2264 (1.8691)  loss_giou_1: 1.6625 (2.1449)  loss_ce_2: 0.7058 (0.7745)  loss_bbox_2: 1.1000 (1.8894)  loss_giou_2: 1.5367 (2.1538)  loss_ce_3: 0.7138 (0.7822)  loss_bbox_3: 0.9728 (1.8292)  loss_giou_3: 1.3945 (2.1083)  loss_ce_4: 0.7190 (0.7797)  loss_bbox_4: 1.2304 (1.9368)  loss_giou_4: 1.6800 (2.2033)  loss_ce_unscaled: 0.7220 (0.7781)  class_error_unscaled: 100.0000 (99.2674)  loss_bbox_unscaled: 0.1894 (0.3686)  loss_giou_unscaled: 0.7447 (1.0566)  cardinality_error_unscaled: 3.0000 (4.9945)  loss_ce_0_unscaled: 0.7117 (0.7650)  loss_bbox_0_unscaled: 0.2025 (0.3693)  loss_giou_0_unscaled: 0.7075 (1.0586)  cardinality_error_0_unscaled: 3.0000 (3.9533)  loss_ce_1_unscaled: 0.7015 (0.7705)  loss_bbox_1_unscaled: 0.2453 (0.3738)  loss_giou_1_unscaled: 0.8313 (1.0725)  cardinality_error_1_unscaled: 3.0000 (4.1016)  loss_ce_2_unscaled: 0.7058 (0.7745)  loss_bbox_2_unscaled: 0.2200 (0.3779)  loss_giou_2_unscaled: 0.7684 (1.0769)  cardinality_error_2_unscaled: 3.0000 (4.9038)  loss_ce_3_unscaled: 0.7138 (0.7822)  loss_bbox_3_unscaled: 0.1946 (0.3658)  loss_giou_3_unscaled: 0.6972 (1.0541)  cardinality_error_3_unscaled: 3.0000 (4.9945)  loss_ce_4_unscaled: 0.7190 (0.7797)  loss_bbox_4_unscaled: 0.2461 (0.3874)  loss_giou_4_unscaled: 0.8400 (1.1017)  cardinality_error_4_unscaled: 3.0000 (4.9918)  time: 0.3271  data: 0.0144  max mem: 5477
Epoch: [0]  [100/225]  eta: 0:00:42  lr: 0.001000  class_error: 100.00  loss: 20.1649 (27.9541)  loss_ce: 0.7003 (0.7687)  loss_bbox: 1.0745 (1.7630)  loss_giou: 1.5262 (2.0532)  loss_ce_0: 0.7128 (0.7564)  loss_bbox_0: 1.1028 (1.7936)  loss_giou_0: 1.5624 (2.0816)  loss_ce_1: 0.7015 (0.7603)  loss_bbox_1: 1.1178 (1.7996)  loss_giou_1: 1.4671 (2.0949)  loss_ce_2: 0.6817 (0.7625)  loss_bbox_2: 1.2027 (1.8460)  loss_giou_2: 1.6791 (2.1249)  loss_ce_3: 0.7115 (0.7715)  loss_bbox_3: 1.0009 (1.7556)  loss_giou_3: 1.4495 (2.0527)  loss_ce_4: 0.7190 (0.7703)  loss_bbox_4: 1.1550 (1.8591)  loss_giou_4: 1.5835 (2.1402)  loss_ce_unscaled: 0.7003 (0.7687)  class_error_unscaled: 100.0000 (99.3399)  loss_bbox_unscaled: 0.2149 (0.3526)  loss_giou_unscaled: 0.7631 (1.0266)  cardinality_error_unscaled: 3.0000 (4.7797)  loss_ce_0_unscaled: 0.7128 (0.7564)  loss_bbox_0_unscaled: 0.2206 (0.3587)  loss_giou_0_unscaled: 0.7812 (1.0408)  cardinality_error_0_unscaled: 3.0000 (3.8416)  loss_ce_1_unscaled: 0.7015 (0.7603)  loss_bbox_1_unscaled: 0.2236 (0.3599)  loss_giou_1_unscaled: 0.7335 (1.0474)  cardinality_error_1_unscaled: 3.0000 (3.9752)  loss_ce_2_unscaled: 0.6817 (0.7625)  loss_bbox_2_unscaled: 0.2405 (0.3692)  loss_giou_2_unscaled: 0.8396 (1.0625)  cardinality_error_2_unscaled: 3.0000 (4.6980)  loss_ce_3_unscaled: 0.7115 (0.7715)  loss_bbox_3_unscaled: 0.2002 (0.3511)  loss_giou_3_unscaled: 0.7248 (1.0264)  cardinality_error_3_unscaled: 3.0000 (4.7797)  loss_ce_4_unscaled: 0.7190 (0.7703)  loss_bbox_4_unscaled: 0.2310 (0.3718)  loss_giou_4_unscaled: 0.7918 (1.0701)  cardinality_error_4_unscaled: 3.0000 (4.7772)  time: 0.3336  data: 0.0146  max mem: 5637
Epoch: [0]  [110/225]  eta: 0:00:38  lr: 0.001000  class_error: 100.00  loss: 21.6342 (27.3816)  loss_ce: 0.7064 (0.7642)  loss_bbox: 0.9988 (1.6882)  loss_giou: 1.4457 (1.9938)  loss_ce_0: 0.7238 (0.7540)  loss_bbox_0: 1.2679 (1.7486)  loss_giou_0: 1.6559 (2.0487)  loss_ce_1: 0.6962 (0.7567)  loss_bbox_1: 1.1178 (1.7325)  loss_giou_1: 1.5664 (2.0472)  loss_ce_2: 0.7001 (0.7578)  loss_bbox_2: 1.2249 (1.8120)  loss_giou_2: 1.7109 (2.0992)  loss_ce_3: 0.7018 (0.7668)  loss_bbox_3: 1.1599 (1.7221)  loss_giou_3: 1.5990 (2.0310)  loss_ce_4: 0.7098 (0.7668)  loss_bbox_4: 1.1236 (1.7990)  loss_giou_4: 1.4905 (2.0932)  loss_ce_unscaled: 0.7064 (0.7642)  class_error_unscaled: 100.0000 (99.3994)  loss_bbox_unscaled: 0.1998 (0.3376)  loss_giou_unscaled: 0.7229 (0.9969)  cardinality_error_unscaled: 3.0000 (4.6374)  loss_ce_0_unscaled: 0.7238 (0.7540)  loss_bbox_0_unscaled: 0.2536 (0.3497)  loss_giou_0_unscaled: 0.8279 (1.0243)  cardinality_error_0_unscaled: 3.0000 (3.7838)  loss_ce_1_unscaled: 0.6962 (0.7567)  loss_bbox_1_unscaled: 0.2236 (0.3465)  loss_giou_1_unscaled: 0.7832 (1.0236)  cardinality_error_1_unscaled: 3.0000 (3.9054)  loss_ce_2_unscaled: 0.7001 (0.7578)  loss_bbox_2_unscaled: 0.2450 (0.3624)  loss_giou_2_unscaled: 0.8554 (1.0496)  cardinality_error_2_unscaled: 3.0000 (4.5631)  loss_ce_3_unscaled: 0.7018 (0.7668)  loss_bbox_3_unscaled: 0.2320 (0.3444)  loss_giou_3_unscaled: 0.7995 (1.0155)  cardinality_error_3_unscaled: 3.0000 (4.6374)  loss_ce_4_unscaled: 0.7098 (0.7668)  loss_bbox_4_unscaled: 0.2247 (0.3598)  loss_giou_4_unscaled: 0.7453 (1.0466)  cardinality_error_4_unscaled: 3.0000 (4.6351)  time: 0.3367  data: 0.0147  max mem: 5637
Epoch: [0]  [120/225]  eta: 0:00:35  lr: 0.001000  class_error: 100.00  loss: 21.1659 (26.8506)  loss_ce: 0.7089 (0.7577)  loss_bbox: 0.9988 (1.6532)  loss_giou: 1.4848 (1.9689)  loss_ce_0: 0.7133 (0.7496)  loss_bbox_0: 1.0947 (1.7004)  loss_giou_0: 1.5849 (2.0108)  loss_ce_1: 0.7111 (0.7519)  loss_bbox_1: 1.0310 (1.6767)  loss_giou_1: 1.4821 (1.9990)  loss_ce_2: 0.7007 (0.7512)  loss_bbox_2: 1.2147 (1.7573)  loss_giou_2: 1.5715 (2.0571)  loss_ce_3: 0.7257 (0.7607)  loss_bbox_3: 1.3461 (1.6945)  loss_giou_3: 1.7193 (2.0082)  loss_ce_4: 0.7094 (0.7603)  loss_bbox_4: 1.0727 (1.7453)  loss_giou_4: 1.4593 (2.0478)  loss_ce_unscaled: 0.7089 (0.7577)  class_error_unscaled: 100.0000 (99.4490)  loss_bbox_unscaled: 0.1998 (0.3306)  loss_giou_unscaled: 0.7424 (0.9845)  cardinality_error_unscaled: 3.2500 (4.5124)  loss_ce_0_unscaled: 0.7133 (0.7496)  loss_bbox_0_unscaled: 0.2189 (0.3401)  loss_giou_0_unscaled: 0.7924 (1.0054)  cardinality_error_0_unscaled: 3.2500 (3.7273)  loss_ce_1_unscaled: 0.7111 (0.7519)  loss_bbox_1_unscaled: 0.2062 (0.3353)  loss_giou_1_unscaled: 0.7411 (0.9995)  cardinality_error_1_unscaled: 3.2500 (3.8409)  loss_ce_2_unscaled: 0.7007 (0.7512)  loss_bbox_2_unscaled: 0.2429 (0.3515)  loss_giou_2_unscaled: 0.7857 (1.0286)  cardinality_error_2_unscaled: 3.2500 (4.4442)  loss_ce_3_unscaled: 0.7257 (0.7607)  loss_bbox_3_unscaled: 0.2692 (0.3389)  loss_giou_3_unscaled: 0.8596 (1.0041)  cardinality_error_3_unscaled: 3.2500 (4.5124)  loss_ce_4_unscaled: 0.7094 (0.7603)  loss_bbox_4_unscaled: 0.2145 (0.3491)  loss_giou_4_unscaled: 0.7297 (1.0239)  cardinality_error_4_unscaled: 3.2500 (4.5103)  time: 0.3324  data: 0.0146  max mem: 5637
Epoch: [0]  [130/225]  eta: 0:00:31  lr: 0.001000  class_error: 100.00  loss: 19.8474 (26.3022)  loss_ce: 0.6661 (0.7493)  loss_bbox: 1.1333 (1.6103)  loss_giou: 1.5970 (1.9413)  loss_ce_0: 0.6654 (0.7415)  loss_bbox_0: 1.0947 (1.6567)  loss_giou_0: 1.5989 (1.9873)  loss_ce_1: 0.6688 (0.7446)  loss_bbox_1: 1.0219 (1.6198)  loss_giou_1: 1.4521 (1.9542)  loss_ce_2: 0.6613 (0.7427)  loss_bbox_2: 1.0290 (1.6975)  loss_giou_2: 1.5071 (2.0150)  loss_ce_3: 0.6708 (0.7522)  loss_bbox_3: 1.0442 (1.6452)  loss_giou_3: 1.5472 (1.9718)  loss_ce_4: 0.6684 (0.7518)  loss_bbox_4: 1.0742 (1.7012)  loss_giou_4: 1.4883 (2.0199)  loss_ce_unscaled: 0.6661 (0.7493)  class_error_unscaled: 100.0000 (99.4911)  loss_bbox_unscaled: 0.2267 (0.3221)  loss_giou_unscaled: 0.7985 (0.9706)  cardinality_error_unscaled: 2.7500 (4.3664)  loss_ce_0_unscaled: 0.6654 (0.7415)  loss_bbox_0_unscaled: 0.2189 (0.3313)  loss_giou_0_unscaled: 0.7994 (0.9936)  cardinality_error_0_unscaled: 2.7500 (3.6412)  loss_ce_1_unscaled: 0.6688 (0.7446)  loss_bbox_1_unscaled: 0.2044 (0.3240)  loss_giou_1_unscaled: 0.7261 (0.9771)  cardinality_error_1_unscaled: 2.7500 (3.7462)  loss_ce_2_unscaled: 0.6613 (0.7427)  loss_bbox_2_unscaled: 0.2058 (0.3395)  loss_giou_2_unscaled: 0.7535 (1.0075)  cardinality_error_2_unscaled: 2.7500 (4.3034)  loss_ce_3_unscaled: 0.6708 (0.7522)  loss_bbox_3_unscaled: 0.2088 (0.3290)  loss_giou_3_unscaled: 0.7736 (0.9859)  cardinality_error_3_unscaled: 2.7500 (4.3664)  loss_ce_4_unscaled: 0.6684 (0.7518)  loss_bbox_4_unscaled: 0.2148 (0.3402)  loss_giou_4_unscaled: 0.7441 (1.0100)  cardinality_error_4_unscaled: 2.7500 (4.3645)  time: 0.3296  data: 0.0146  max mem: 5637
Epoch: [0]  [140/225]  eta: 0:00:28  lr: 0.001000  class_error: 100.00  loss: 19.8021 (25.8832)  loss_ce: 0.6661 (0.7441)  loss_bbox: 1.1773 (1.5948)  loss_giou: 1.6169 (1.9355)  loss_ce_0: 0.6654 (0.7385)  loss_bbox_0: 1.0968 (1.6210)  loss_giou_0: 1.6052 (1.9583)  loss_ce_1: 0.6688 (0.7396)  loss_bbox_1: 1.0471 (1.5973)  loss_giou_1: 1.5270 (1.9388)  loss_ce_2: 0.6613 (0.7373)  loss_bbox_2: 1.0705 (1.6604)  loss_giou_2: 1.5409 (1.9847)  loss_ce_3: 0.6688 (0.7472)  loss_bbox_3: 0.9242 (1.5927)  loss_giou_3: 1.4197 (1.9273)  loss_ce_4: 0.6690 (0.7476)  loss_bbox_4: 0.9728 (1.6452)  loss_giou_4: 1.4794 (1.9728)  loss_ce_unscaled: 0.6661 (0.7441)  class_error_unscaled: 100.0000 (99.5272)  loss_bbox_unscaled: 0.2355 (0.3190)  loss_giou_unscaled: 0.8085 (0.9678)  cardinality_error_unscaled: 2.7500 (4.2766)  loss_ce_0_unscaled: 0.6654 (0.7385)  loss_bbox_0_unscaled: 0.2194 (0.3242)  loss_giou_0_unscaled: 0.8026 (0.9792)  cardinality_error_0_unscaled: 2.7500 (3.6028)  loss_ce_1_unscaled: 0.6688 (0.7396)  loss_bbox_1_unscaled: 0.2094 (0.3195)  loss_giou_1_unscaled: 0.7635 (0.9694)  cardinality_error_1_unscaled: 2.7500 (3.7004)  loss_ce_2_unscaled: 0.6613 (0.7373)  loss_bbox_2_unscaled: 0.2141 (0.3321)  loss_giou_2_unscaled: 0.7705 (0.9923)  cardinality_error_2_unscaled: 2.7500 (4.2181)  loss_ce_3_unscaled: 0.6688 (0.7472)  loss_bbox_3_unscaled: 0.1848 (0.3185)  loss_giou_3_unscaled: 0.7098 (0.9637)  cardinality_error_3_unscaled: 2.7500 (4.2766)  loss_ce_4_unscaled: 0.6690 (0.7476)  loss_bbox_4_unscaled: 0.1946 (0.3290)  loss_giou_4_unscaled: 0.7397 (0.9864)  cardinality_error_4_unscaled: 2.7500 (4.2748)  time: 0.3280  data: 0.0146  max mem: 5637
Epoch: [0]  [150/225]  eta: 0:00:25  lr: 0.001000  class_error: 100.00  loss: 19.4377 (25.4237)  loss_ce: 0.6484 (0.7362)  loss_bbox: 1.1630 (1.5634)  loss_giou: 1.6169 (1.9105)  loss_ce_0: 0.6593 (0.7312)  loss_bbox_0: 1.0526 (1.5859)  loss_giou_0: 1.5327 (1.9289)  loss_ce_1: 0.6352 (0.7314)  loss_bbox_1: 1.0940 (1.5660)  loss_giou_1: 1.5832 (1.9111)  loss_ce_2: 0.6494 (0.7303)  loss_bbox_2: 1.0556 (1.6159)  loss_giou_2: 1.4842 (1.9477)  loss_ce_3: 0.6339 (0.7397)  loss_bbox_3: 0.8654 (1.5534)  loss_giou_3: 1.3514 (1.8963)  loss_ce_4: 0.6644 (0.7394)  loss_bbox_4: 0.9673 (1.6013)  loss_giou_4: 1.3806 (1.9351)  loss_ce_unscaled: 0.6484 (0.7362)  class_error_unscaled: 100.0000 (99.5585)  loss_bbox_unscaled: 0.2326 (0.3127)  loss_giou_unscaled: 0.8085 (0.9552)  cardinality_error_unscaled: 2.7500 (4.1556)  loss_ce_0_unscaled: 0.6593 (0.7312)  loss_bbox_0_unscaled: 0.2105 (0.3172)  loss_giou_0_unscaled: 0.7664 (0.9644)  cardinality_error_0_unscaled: 2.7500 (3.5265)  loss_ce_1_unscaled: 0.6352 (0.7314)  loss_bbox_1_unscaled: 0.2188 (0.3132)  loss_giou_1_unscaled: 0.7916 (0.9556)  cardinality_error_1_unscaled: 2.7500 (3.6175)  loss_ce_2_unscaled: 0.6494 (0.7303)  loss_bbox_2_unscaled: 0.2111 (0.3232)  loss_giou_2_unscaled: 0.7421 (0.9738)  cardinality_error_2_unscaled: 2.7500 (4.1010)  loss_ce_3_unscaled: 0.6339 (0.7397)  loss_bbox_3_unscaled: 0.1731 (0.3107)  loss_giou_3_unscaled: 0.6757 (0.9481)  cardinality_error_3_unscaled: 2.7500 (4.1556)  loss_ce_4_unscaled: 0.6644 (0.7394)  loss_bbox_4_unscaled: 0.1935 (0.3203)  loss_giou_4_unscaled: 0.6903 (0.9675)  cardinality_error_4_unscaled: 2.7500 (4.1540)  time: 0.3254  data: 0.0145  max mem: 5637
Epoch: [0]  [160/225]  eta: 0:00:21  lr: 0.001000  class_error: 100.00  loss: 18.5955 (24.9967)  loss_ce: 0.6175 (0.7305)  loss_bbox: 1.0154 (1.5311)  loss_giou: 1.4537 (1.8840)  loss_ce_0: 0.6441 (0.7263)  loss_bbox_0: 1.0167 (1.5470)  loss_giou_0: 1.4343 (1.8967)  loss_ce_1: 0.6136 (0.7256)  loss_bbox_1: 0.9467 (1.5288)  loss_giou_1: 1.4095 (1.8783)  loss_ce_2: 0.6178 (0.7251)  loss_bbox_2: 0.9770 (1.5762)  loss_giou_2: 1.4386 (1.9168)  loss_ce_3: 0.6290 (0.7335)  loss_bbox_3: 0.9467 (1.5208)  loss_giou_3: 1.4775 (1.8709)  loss_ce_4: 0.6210 (0.7343)  loss_bbox_4: 0.9768 (1.5656)  loss_giou_4: 1.3806 (1.9053)  loss_ce_unscaled: 0.6175 (0.7305)  class_error_unscaled: 100.0000 (99.5859)  loss_bbox_unscaled: 0.2031 (0.3062)  loss_giou_unscaled: 0.7269 (0.9420)  cardinality_error_unscaled: 2.5000 (4.0590)  loss_ce_0_unscaled: 0.6441 (0.7263)  loss_bbox_0_unscaled: 0.2033 (0.3094)  loss_giou_0_unscaled: 0.7172 (0.9484)  cardinality_error_0_unscaled: 2.5000 (3.4689)  loss_ce_1_unscaled: 0.6136 (0.7256)  loss_bbox_1_unscaled: 0.1893 (0.3058)  loss_giou_1_unscaled: 0.7048 (0.9391)  cardinality_error_1_unscaled: 2.5000 (3.5543)  loss_ce_2_unscaled: 0.6178 (0.7251)  loss_bbox_2_unscaled: 0.1954 (0.3152)  loss_giou_2_unscaled: 0.7193 (0.9584)  cardinality_error_2_unscaled: 2.5000 (4.0078)  loss_ce_3_unscaled: 0.6290 (0.7335)  loss_bbox_3_unscaled: 0.1893 (0.3042)  loss_giou_3_unscaled: 0.7387 (0.9354)  cardinality_error_3_unscaled: 2.5000 (4.0590)  loss_ce_4_unscaled: 0.6210 (0.7343)  loss_bbox_4_unscaled: 0.1954 (0.3131)  loss_giou_4_unscaled: 0.6903 (0.9526)  cardinality_error_4_unscaled: 2.5000 (4.0575)  time: 0.3258  data: 0.0145  max mem: 5637
Epoch: [0]  [170/225]  eta: 0:00:18  lr: 0.001000  class_error: 100.00  loss: 18.7155 (24.6388)  loss_ce: 0.6814 (0.7295)  loss_bbox: 0.9805 (1.4954)  loss_giou: 1.4094 (1.8550)  loss_ce_0: 0.6808 (0.7260)  loss_bbox_0: 0.9178 (1.5164)  loss_giou_0: 1.4553 (1.8778)  loss_ce_1: 0.6558 (0.7249)  loss_bbox_1: 0.9360 (1.4940)  loss_giou_1: 1.3779 (1.8533)  loss_ce_2: 0.6761 (0.7245)  loss_bbox_2: 0.9238 (1.5357)  loss_giou_2: 1.3853 (1.8836)  loss_ce_3: 0.6566 (0.7324)  loss_bbox_3: 0.9437 (1.4854)  loss_giou_3: 1.4492 (1.8438)  loss_ce_4: 0.6907 (0.7334)  loss_bbox_4: 0.9971 (1.5383)  loss_giou_4: 1.5085 (1.8894)  loss_ce_unscaled: 0.6814 (0.7295)  class_error_unscaled: 100.0000 (99.6101)  loss_bbox_unscaled: 0.1961 (0.2991)  loss_giou_unscaled: 0.7047 (0.9275)  cardinality_error_unscaled: 2.7500 (4.0073)  loss_ce_0_unscaled: 0.6808 (0.7260)  loss_bbox_0_unscaled: 0.1836 (0.3033)  loss_giou_0_unscaled: 0.7277 (0.9389)  cardinality_error_0_unscaled: 2.7500 (3.4488)  loss_ce_1_unscaled: 0.6558 (0.7249)  loss_bbox_1_unscaled: 0.1872 (0.2988)  loss_giou_1_unscaled: 0.6889 (0.9267)  cardinality_error_1_unscaled: 2.7500 (3.5322)  loss_ce_2_unscaled: 0.6761 (0.7245)  loss_bbox_2_unscaled: 0.1848 (0.3071)  loss_giou_2_unscaled: 0.6927 (0.9418)  cardinality_error_2_unscaled: 2.7500 (3.9576)  loss_ce_3_unscaled: 0.6566 (0.7324)  loss_bbox_3_unscaled: 0.1887 (0.2971)  loss_giou_3_unscaled: 0.7246 (0.9219)  cardinality_error_3_unscaled: 2.7500 (4.0073)  loss_ce_4_unscaled: 0.6907 (0.7334)  loss_bbox_4_unscaled: 0.1994 (0.3077)  loss_giou_4_unscaled: 0.7542 (0.9447)  cardinality_error_4_unscaled: 2.7500 (4.0015)  time: 0.3224  data: 0.0144  max mem: 5637
Epoch: [0]  [180/225]  eta: 0:00:14  lr: 0.001000  class_error: 100.00  loss: 18.4054 (24.3400)  loss_ce: 0.7040 (0.7282)  loss_bbox: 0.9830 (1.4731)  loss_giou: 1.3940 (1.8341)  loss_ce_0: 0.7026 (0.7243)  loss_bbox_0: 1.0940 (1.4929)  loss_giou_0: 1.4768 (1.8555)  loss_ce_1: 0.7101 (0.7233)  loss_bbox_1: 0.9232 (1.4649)  loss_giou_1: 1.4000 (1.8267)  loss_ce_2: 0.7053 (0.7232)  loss_bbox_2: 0.9237 (1.5075)  loss_giou_2: 1.3560 (1.8598)  loss_ce_3: 0.7139 (0.7305)  loss_bbox_3: 0.9044 (1.4594)  loss_giou_3: 1.3482 (1.8174)  loss_ce_4: 0.7048 (0.7320)  loss_bbox_4: 1.0585 (1.5171)  loss_giou_4: 1.5085 (1.8700)  loss_ce_unscaled: 0.7040 (0.7282)  class_error_unscaled: 100.0000 (99.6317)  loss_bbox_unscaled: 0.1966 (0.2946)  loss_giou_unscaled: 0.6970 (0.9170)  cardinality_error_unscaled: 3.2500 (3.9599)  loss_ce_0_unscaled: 0.7026 (0.7243)  loss_bbox_0_unscaled: 0.2188 (0.2986)  loss_giou_0_unscaled: 0.7384 (0.9278)  cardinality_error_0_unscaled: 3.2500 (3.4309)  loss_ce_1_unscaled: 0.7101 (0.7233)  loss_bbox_1_unscaled: 0.1846 (0.2930)  loss_giou_1_unscaled: 0.7000 (0.9134)  cardinality_error_1_unscaled: 3.2500 (3.5110)  loss_ce_2_unscaled: 0.7053 (0.7232)  loss_bbox_2_unscaled: 0.1847 (0.3015)  loss_giou_2_unscaled: 0.6780 (0.9299)  cardinality_error_2_unscaled: 3.2500 (3.9130)  loss_ce_3_unscaled: 0.7139 (0.7305)  loss_bbox_3_unscaled: 0.1809 (0.2919)  loss_giou_3_unscaled: 0.6741 (0.9087)  cardinality_error_3_unscaled: 3.2500 (3.9586)  loss_ce_4_unscaled: 0.7048 (0.7320)  loss_bbox_4_unscaled: 0.2117 (0.3034)  loss_giou_4_unscaled: 0.7542 (0.9350)  cardinality_error_4_unscaled: 3.2500 (3.9517)  time: 0.3212  data: 0.0148  max mem: 5637
Epoch: [0]  [190/225]  eta: 0:00:11  lr: 0.001000  class_error: 100.00  loss: 18.0909 (23.9890)  loss_ce: 0.7036 (0.7262)  loss_bbox: 0.9683 (1.4462)  loss_giou: 1.4349 (1.8160)  loss_ce_0: 0.7026 (0.7226)  loss_bbox_0: 0.9071 (1.4586)  loss_giou_0: 1.3566 (1.8274)  loss_ce_1: 0.6941 (0.7210)  loss_bbox_1: 0.8661 (1.4328)  loss_giou_1: 1.3329 (1.8017)  loss_ce_2: 0.7035 (0.7214)  loss_bbox_2: 0.8918 (1.4733)  loss_giou_2: 1.3383 (1.8350)  loss_ce_3: 0.6843 (0.7281)  loss_bbox_3: 0.8703 (1.4305)  loss_giou_3: 1.3343 (1.7969)  loss_ce_4: 0.7020 (0.7305)  loss_bbox_4: 0.9180 (1.4799)  loss_giou_4: 1.3109 (1.8409)  loss_ce_unscaled: 0.7036 (0.7262)  class_error_unscaled: 100.0000 (99.6510)  loss_bbox_unscaled: 0.1937 (0.2892)  loss_giou_unscaled: 0.7174 (0.9080)  cardinality_error_unscaled: 3.2500 (3.9110)  loss_ce_0_unscaled: 0.7026 (0.7226)  loss_bbox_0_unscaled: 0.1814 (0.2917)  loss_giou_0_unscaled: 0.6783 (0.9137)  cardinality_error_0_unscaled: 3.2500 (3.4097)  loss_ce_1_unscaled: 0.6941 (0.7210)  loss_bbox_1_unscaled: 0.1732 (0.2866)  loss_giou_1_unscaled: 0.6665 (0.9009)  cardinality_error_1_unscaled: 3.2500 (3.4856)  loss_ce_2_unscaled: 0.7035 (0.7214)  loss_bbox_2_unscaled: 0.1784 (0.2947)  loss_giou_2_unscaled: 0.6692 (0.9175)  cardinality_error_2_unscaled: 3.2500 (3.8665)  loss_ce_3_unscaled: 0.6843 (0.7281)  loss_bbox_3_unscaled: 0.1741 (0.2861)  loss_giou_3_unscaled: 0.6672 (0.8984)  cardinality_error_3_unscaled: 3.2500 (3.9097)  loss_ce_4_unscaled: 0.7020 (0.7305)  loss_bbox_4_unscaled: 0.1836 (0.2960)  loss_giou_4_unscaled: 0.6555 (0.9204)  cardinality_error_4_unscaled: 3.2500 (3.9031)  time: 0.3258  data: 0.0149  max mem: 5640
Epoch: [0]  [200/225]  eta: 0:00:08  lr: 0.001000  class_error: 100.00  loss: 17.8010 (23.7049)  loss_ce: 0.6675 (0.7230)  loss_bbox: 0.9316 (1.4328)  loss_giou: 1.4737 (1.8085)  loss_ce_0: 0.6540 (0.7194)  loss_bbox_0: 0.8879 (1.4402)  loss_giou_0: 1.3566 (1.8152)  loss_ce_1: 0.6669 (0.7185)  loss_bbox_1: 0.8515 (1.4078)  loss_giou_1: 1.3605 (1.7799)  loss_ce_2: 0.6703 (0.7191)  loss_bbox_2: 0.8717 (1.4446)  loss_giou_2: 1.3383 (1.8108)  loss_ce_3: 0.6720 (0.7254)  loss_bbox_3: 0.8708 (1.4028)  loss_giou_3: 1.3336 (1.7713)  loss_ce_4: 0.6676 (0.7272)  loss_bbox_4: 0.7845 (1.4471)  loss_giou_4: 1.2801 (1.8113)  loss_ce_unscaled: 0.6675 (0.7230)  class_error_unscaled: 100.0000 (99.6683)  loss_bbox_unscaled: 0.1863 (0.2866)  loss_giou_unscaled: 0.7369 (0.9042)  cardinality_error_unscaled: 2.7500 (3.8507)  loss_ce_0_unscaled: 0.6540 (0.7194)  loss_bbox_0_unscaled: 0.1776 (0.2880)  loss_giou_0_unscaled: 0.6783 (0.9076)  cardinality_error_0_unscaled: 2.7500 (3.3769)  loss_ce_1_unscaled: 0.6669 (0.7185)  loss_bbox_1_unscaled: 0.1703 (0.2816)  loss_giou_1_unscaled: 0.6803 (0.8900)  cardinality_error_1_unscaled: 2.7500 (3.4490)  loss_ce_2_unscaled: 0.6703 (0.7191)  loss_bbox_2_unscaled: 0.1743 (0.2889)  loss_giou_2_unscaled: 0.6692 (0.9054)  cardinality_error_2_unscaled: 2.7500 (3.8109)  loss_ce_3_unscaled: 0.6720 (0.7254)  loss_bbox_3_unscaled: 0.1742 (0.2806)  loss_giou_3_unscaled: 0.6668 (0.8857)  cardinality_error_3_unscaled: 2.7500 (3.8520)  loss_ce_4_unscaled: 0.6676 (0.7272)  loss_bbox_4_unscaled: 0.1569 (0.2894)  loss_giou_4_unscaled: 0.6400 (0.9056)  cardinality_error_4_unscaled: 2.7500 (3.8458)  time: 0.3301  data: 0.0145  max mem: 5640
Epoch: [0]  [210/225]  eta: 0:00:04  lr: 0.001000  class_error: 100.00  loss: 18.0540 (23.4763)  loss_ce: 0.6480 (0.7208)  loss_bbox: 1.0048 (1.4134)  loss_giou: 1.5002 (1.7948)  loss_ce_0: 0.6452 (0.7170)  loss_bbox_0: 1.0194 (1.4193)  loss_giou_0: 1.5324 (1.8009)  loss_ce_1: 0.6662 (0.7162)  loss_bbox_1: 0.9526 (1.3892)  loss_giou_1: 1.4151 (1.7668)  loss_ce_2: 0.6622 (0.7172)  loss_bbox_2: 0.9372 (1.4226)  loss_giou_2: 1.3879 (1.7945)  loss_ce_3: 0.6688 (0.7231)  loss_bbox_3: 0.8915 (1.3805)  loss_giou_3: 1.3420 (1.7544)  loss_ce_4: 0.6564 (0.7250)  loss_bbox_4: 0.8851 (1.4247)  loss_giou_4: 1.3233 (1.7959)  loss_ce_unscaled: 0.6480 (0.7208)  class_error_unscaled: 100.0000 (99.6840)  loss_bbox_unscaled: 0.2010 (0.2827)  loss_giou_unscaled: 0.7501 (0.8974)  cardinality_error_unscaled: 2.5000 (3.8021)  loss_ce_0_unscaled: 0.6452 (0.7170)  loss_bbox_0_unscaled: 0.2039 (0.2839)  loss_giou_0_unscaled: 0.7662 (0.9004)  cardinality_error_0_unscaled: 2.5000 (3.3507)  loss_ce_1_unscaled: 0.6662 (0.7162)  loss_bbox_1_unscaled: 0.1905 (0.2778)  loss_giou_1_unscaled: 0.7075 (0.8834)  cardinality_error_1_unscaled: 2.5000 (3.4194)  loss_ce_2_unscaled: 0.6622 (0.7172)  loss_bbox_2_unscaled: 0.1874 (0.2845)  loss_giou_2_unscaled: 0.6940 (0.8972)  cardinality_error_2_unscaled: 2.5000 (3.7642)  loss_ce_3_unscaled: 0.6688 (0.7231)  loss_bbox_3_unscaled: 0.1783 (0.2761)  loss_giou_3_unscaled: 0.6710 (0.8772)  cardinality_error_3_unscaled: 2.5000 (3.8033)  loss_ce_4_unscaled: 0.6564 (0.7250)  loss_bbox_4_unscaled: 0.1770 (0.2849)  loss_giou_4_unscaled: 0.6616 (0.8979)  cardinality_error_4_unscaled: 2.5000 (3.7974)  time: 0.3318  data: 0.0148  max mem: 5640
Epoch: [0]  [220/225]  eta: 0:00:01  lr: 0.001000  class_error: 100.00  loss: 18.4685 (23.2550)  loss_ce: 0.7104 (0.7212)  loss_bbox: 1.0054 (1.3937)  loss_giou: 1.4747 (1.7770)  loss_ce_0: 0.7039 (0.7182)  loss_bbox_0: 0.8646 (1.3952)  loss_giou_0: 1.3435 (1.7785)  loss_ce_1: 0.7096 (0.7172)  loss_bbox_1: 0.9061 (1.3666)  loss_giou_1: 1.4124 (1.7470)  loss_ce_2: 0.7033 (0.7182)  loss_bbox_2: 1.0047 (1.4031)  loss_giou_2: 1.3926 (1.7766)  loss_ce_3: 0.7101 (0.7243)  loss_bbox_3: 0.8724 (1.3616)  loss_giou_3: 1.3556 (1.7367)  loss_ce_4: 0.7159 (0.7252)  loss_bbox_4: 0.9905 (1.4106)  loss_giou_4: 1.4590 (1.7839)  loss_ce_unscaled: 0.7104 (0.7212)  class_error_unscaled: 100.0000 (99.6983)  loss_bbox_unscaled: 0.2011 (0.2787)  loss_giou_unscaled: 0.7374 (0.8885)  cardinality_error_unscaled: 3.2500 (3.7896)  loss_ce_0_unscaled: 0.7039 (0.7182)  loss_bbox_0_unscaled: 0.1729 (0.2790)  loss_giou_0_unscaled: 0.6718 (0.8893)  cardinality_error_0_unscaled: 3.0000 (3.3575)  loss_ce_1_unscaled: 0.7096 (0.7172)  loss_bbox_1_unscaled: 0.1812 (0.2733)  loss_giou_1_unscaled: 0.7062 (0.8735)  cardinality_error_1_unscaled: 3.2500 (3.4231)  loss_ce_2_unscaled: 0.7033 (0.7182)  loss_bbox_2_unscaled: 0.2009 (0.2806)  loss_giou_2_unscaled: 0.6963 (0.8883)  cardinality_error_2_unscaled: 3.2500 (3.7534)  loss_ce_3_unscaled: 0.7101 (0.7243)  loss_bbox_3_unscaled: 0.1745 (0.2723)  loss_giou_3_unscaled: 0.6778 (0.8684)  cardinality_error_3_unscaled: 3.2500 (3.7919)  loss_ce_4_unscaled: 0.7159 (0.7252)  loss_bbox_4_unscaled: 0.1981 (0.2821)  loss_giou_4_unscaled: 0.7295 (0.8919)  cardinality_error_4_unscaled: 3.2500 (3.7862)  time: 0.3244  data: 0.0147  max mem: 5640
Epoch: [0]  [224/225]  eta: 0:00:00  lr: 0.001000  class_error: 100.00  loss: 18.4685 (23.1879)  loss_ce: 0.6992 (0.7193)  loss_bbox: 1.0062 (1.3909)  loss_giou: 1.4650 (1.7726)  loss_ce_0: 0.7027 (0.7163)  loss_bbox_0: 0.8961 (1.3926)  loss_giou_0: 1.3753 (1.7748)  loss_ce_1: 0.7094 (0.7157)  loss_bbox_1: 0.9046 (1.3604)  loss_giou_1: 1.3026 (1.7393)  loss_ce_2: 0.6922 (0.7161)  loss_bbox_2: 1.0614 (1.4012)  loss_giou_2: 1.3909 (1.7740)  loss_ce_3: 0.7083 (0.7224)  loss_bbox_3: 0.9956 (1.3571)  loss_giou_3: 1.3279 (1.7304)  loss_ce_4: 0.7025 (0.7232)  loss_bbox_4: 1.0279 (1.4045)  loss_giou_4: 1.4262 (1.7770)  loss_ce_unscaled: 0.6992 (0.7193)  class_error_unscaled: 100.0000 (99.7037)  loss_bbox_unscaled: 0.2012 (0.2782)  loss_giou_unscaled: 0.7325 (0.8863)  cardinality_error_unscaled: 3.0000 (3.7622)  loss_ce_0_unscaled: 0.7027 (0.7163)  loss_bbox_0_unscaled: 0.1792 (0.2785)  loss_giou_0_unscaled: 0.6876 (0.8874)  cardinality_error_0_unscaled: 3.0000 (3.3389)  loss_ce_1_unscaled: 0.7094 (0.7157)  loss_bbox_1_unscaled: 0.1809 (0.2721)  loss_giou_1_unscaled: 0.6513 (0.8697)  cardinality_error_1_unscaled: 3.0000 (3.4033)  loss_ce_2_unscaled: 0.6922 (0.7161)  loss_bbox_2_unscaled: 0.2123 (0.2802)  loss_giou_2_unscaled: 0.6955 (0.8870)  cardinality_error_2_unscaled: 3.0000 (3.7278)  loss_ce_3_unscaled: 0.7083 (0.7224)  loss_bbox_3_unscaled: 0.1991 (0.2714)  loss_giou_3_unscaled: 0.6639 (0.8652)  cardinality_error_3_unscaled: 3.0000 (3.7656)  loss_ce_4_unscaled: 0.7025 (0.7232)  loss_bbox_4_unscaled: 0.2056 (0.2809)  loss_giou_4_unscaled: 0.7131 (0.8885)  cardinality_error_4_unscaled: 3.0000 (3.7600)  time: 0.3175  data: 0.0142  max mem: 5640
Epoch: [0] Total time: 0:01:14 (0.3314 s / it)
Averaged stats: lr: 0.001000  class_error: 100.00  loss: 18.4685 (23.1879)  loss_ce: 0.6992 (0.7193)  loss_bbox: 1.0062 (1.3909)  loss_giou: 1.4650 (1.7726)  loss_ce_0: 0.7027 (0.7163)  loss_bbox_0: 0.8961 (1.3926)  loss_giou_0: 1.3753 (1.7748)  loss_ce_1: 0.7094 (0.7157)  loss_bbox_1: 0.9046 (1.3604)  loss_giou_1: 1.3026 (1.7393)  loss_ce_2: 0.6922 (0.7161)  loss_bbox_2: 1.0614 (1.4012)  loss_giou_2: 1.3909 (1.7740)  loss_ce_3: 0.7083 (0.7224)  loss_bbox_3: 0.9956 (1.3571)  loss_giou_3: 1.3279 (1.7304)  loss_ce_4: 0.7025 (0.7232)  loss_bbox_4: 1.0279 (1.4045)  loss_giou_4: 1.4262 (1.7770)  loss_ce_unscaled: 0.6992 (0.7193)  class_error_unscaled: 100.0000 (99.7037)  loss_bbox_unscaled: 0.2012 (0.2782)  loss_giou_unscaled: 0.7325 (0.8863)  cardinality_error_unscaled: 3.0000 (3.7622)  loss_ce_0_unscaled: 0.7027 (0.7163)  loss_bbox_0_unscaled: 0.1792 (0.2785)  loss_giou_0_unscaled: 0.6876 (0.8874)  cardinality_error_0_unscaled: 3.0000 (3.3389)  loss_ce_1_unscaled: 0.7094 (0.7157)  loss_bbox_1_unscaled: 0.1809 (0.2721)  loss_giou_1_unscaled: 0.6513 (0.8697)  cardinality_error_1_unscaled: 3.0000 (3.4033)  loss_ce_2_unscaled: 0.6922 (0.7161)  loss_bbox_2_unscaled: 0.2123 (0.2802)  loss_giou_2_unscaled: 0.6955 (0.8870)  cardinality_error_2_unscaled: 3.0000 (3.7278)  loss_ce_3_unscaled: 0.7083 (0.7224)  loss_bbox_3_unscaled: 0.1991 (0.2714)  loss_giou_3_unscaled: 0.6639 (0.8652)  cardinality_error_3_unscaled: 3.0000 (3.7656)  loss_ce_4_unscaled: 0.7025 (0.7232)  loss_bbox_4_unscaled: 0.2056 (0.2809)  loss_giou_4_unscaled: 0.7131 (0.8885)  cardinality_error_4_unscaled: 3.0000 (3.7600)
Test:  [ 0/25]  eta: 0:00:11  class_error: 100.00  loss: 39.7441 (39.7441)  loss_ce: 0.8244 (0.8244)  loss_bbox: 2.4085 (2.4085)  loss_giou: 2.7545 (2.7545)  loss_ce_0: 0.8306 (0.8306)  loss_bbox_0: 3.6152 (3.6152)  loss_giou_0: 3.1071 (3.1071)  loss_ce_1: 0.8320 (0.8320)  loss_bbox_1: 2.8095 (2.8095)  loss_giou_1: 3.0116 (3.0116)  loss_ce_2: 0.8365 (0.8365)  loss_bbox_2: 3.6158 (3.6158)  loss_giou_2: 3.1490 (3.1490)  loss_ce_3: 0.8286 (0.8286)  loss_bbox_3: 2.3526 (2.3526)  loss_giou_3: 2.6735 (2.6735)  loss_ce_4: 0.8391 (0.8391)  loss_bbox_4: 2.5136 (2.5136)  loss_giou_4: 2.7420 (2.7420)  loss_ce_unscaled: 0.8244 (0.8244)  class_error_unscaled: 100.0000 (100.0000)  loss_bbox_unscaled: 0.4817 (0.4817)  loss_giou_unscaled: 1.3773 (1.3773)  cardinality_error_unscaled: 4.0000 (4.0000)  loss_ce_0_unscaled: 0.8306 (0.8306)  loss_bbox_0_unscaled: 0.7230 (0.7230)  loss_giou_0_unscaled: 1.5536 (1.5536)  cardinality_error_0_unscaled: 4.0000 (4.0000)  loss_ce_1_unscaled: 0.8320 (0.8320)  loss_bbox_1_unscaled: 0.5619 (0.5619)  loss_giou_1_unscaled: 1.5058 (1.5058)  cardinality_error_1_unscaled: 4.0000 (4.0000)  loss_ce_2_unscaled: 0.8365 (0.8365)  loss_bbox_2_unscaled: 0.7232 (0.7232)  loss_giou_2_unscaled: 1.5745 (1.5745)  cardinality_error_2_unscaled: 4.0000 (4.0000)  loss_ce_3_unscaled: 0.8286 (0.8286)  loss_bbox_3_unscaled: 0.4705 (0.4705)  loss_giou_3_unscaled: 1.3368 (1.3368)  cardinality_error_3_unscaled: 4.0000 (4.0000)  loss_ce_4_unscaled: 0.8391 (0.8391)  loss_bbox_4_unscaled: 0.5027 (0.5027)  loss_giou_4_unscaled: 1.3710 (1.3710)  cardinality_error_4_unscaled: 4.0000 (4.0000)  time: 0.4770  data: 0.2986  max mem: 5640
Test:  [10/25]  eta: 0:00:03  class_error: 100.00  loss: 41.7285 (42.7447)  loss_ce: 0.7112 (0.7057)  loss_bbox: 2.8246 (2.9384)  loss_giou: 2.9177 (2.9712)  loss_ce_0: 0.7185 (0.7087)  loss_bbox_0: 3.9275 (4.0503)  loss_giou_0: 3.3074 (3.3029)  loss_ce_1: 0.7018 (0.7094)  loss_bbox_1: 3.0020 (3.0818)  loss_giou_1: 3.0118 (3.0638)  loss_ce_2: 0.7163 (0.7117)  loss_bbox_2: 3.7587 (3.9167)  loss_giou_2: 3.3040 (3.2931)  loss_ce_3: 0.7120 (0.7075)  loss_bbox_3: 2.7522 (2.8755)  loss_giou_3: 2.8856 (2.9507)  loss_ce_4: 0.7222 (0.7131)  loss_bbox_4: 2.9204 (3.0646)  loss_giou_4: 2.9303 (2.9795)  loss_ce_unscaled: 0.7112 (0.7057)  class_error_unscaled: 100.0000 (100.0000)  loss_bbox_unscaled: 0.5649 (0.5877)  loss_giou_unscaled: 1.4588 (1.4856)  cardinality_error_unscaled: 3.0000 (3.0682)  loss_ce_0_unscaled: 0.7185 (0.7087)  loss_bbox_0_unscaled: 0.7855 (0.8101)  loss_giou_0_unscaled: 1.6537 (1.6514)  cardinality_error_0_unscaled: 3.0000 (3.0682)  loss_ce_1_unscaled: 0.7018 (0.7094)  loss_bbox_1_unscaled: 0.6004 (0.6164)  loss_giou_1_unscaled: 1.5059 (1.5319)  cardinality_error_1_unscaled: 3.0000 (3.0682)  loss_ce_2_unscaled: 0.7163 (0.7117)  loss_bbox_2_unscaled: 0.7517 (0.7833)  loss_giou_2_unscaled: 1.6520 (1.6465)  cardinality_error_2_unscaled: 3.0000 (3.0682)  loss_ce_3_unscaled: 0.7120 (0.7075)  loss_bbox_3_unscaled: 0.5504 (0.5751)  loss_giou_3_unscaled: 1.4428 (1.4753)  cardinality_error_3_unscaled: 3.0000 (3.0682)  loss_ce_4_unscaled: 0.7222 (0.7131)  loss_bbox_4_unscaled: 0.5841 (0.6129)  loss_giou_4_unscaled: 1.4651 (1.4898)  cardinality_error_4_unscaled: 3.0000 (3.0682)  time: 0.2017  data: 0.0414  max mem: 5640
Test:  [20/25]  eta: 0:00:00  class_error: 100.00  loss: 41.2935 (41.5291)  loss_ce: 0.6903 (0.7185)  loss_bbox: 2.7155 (2.8024)  loss_giou: 2.9177 (2.9212)  loss_ce_0: 0.6937 (0.7217)  loss_bbox_0: 3.8261 (3.8550)  loss_giou_0: 3.2795 (3.2636)  loss_ce_1: 0.7012 (0.7231)  loss_bbox_1: 2.8832 (2.9342)  loss_giou_1: 2.9980 (2.9970)  loss_ce_2: 0.6953 (0.7253)  loss_bbox_2: 3.6141 (3.7033)  loss_giou_2: 3.2122 (3.2223)  loss_ce_3: 0.6916 (0.7206)  loss_bbox_3: 2.6811 (2.7424)  loss_giou_3: 2.8825 (2.9045)  loss_ce_4: 0.6974 (0.7266)  loss_bbox_4: 2.8724 (2.9167)  loss_giou_4: 2.9303 (2.9306)  loss_ce_unscaled: 0.6903 (0.7185)  class_error_unscaled: 100.0000 (100.0000)  loss_bbox_unscaled: 0.5431 (0.5605)  loss_giou_unscaled: 1.4588 (1.4606)  cardinality_error_unscaled: 3.0000 (3.1548)  loss_ce_0_unscaled: 0.6937 (0.7217)  loss_bbox_0_unscaled: 0.7652 (0.7710)  loss_giou_0_unscaled: 1.6397 (1.6318)  cardinality_error_0_unscaled: 3.0000 (3.1548)  loss_ce_1_unscaled: 0.7012 (0.7231)  loss_bbox_1_unscaled: 0.5766 (0.5868)  loss_giou_1_unscaled: 1.4990 (1.4985)  cardinality_error_1_unscaled: 3.0000 (3.1548)  loss_ce_2_unscaled: 0.6953 (0.7253)  loss_bbox_2_unscaled: 0.7228 (0.7407)  loss_giou_2_unscaled: 1.6061 (1.6111)  cardinality_error_2_unscaled: 3.0000 (3.1548)  loss_ce_3_unscaled: 0.6916 (0.7206)  loss_bbox_3_unscaled: 0.5362 (0.5485)  loss_giou_3_unscaled: 1.4412 (1.4522)  cardinality_error_3_unscaled: 3.0000 (3.1548)  loss_ce_4_unscaled: 0.6974 (0.7266)  loss_bbox_4_unscaled: 0.5745 (0.5833)  loss_giou_4_unscaled: 1.4651 (1.4653)  cardinality_error_4_unscaled: 3.0000 (3.1548)  time: 0.1793  data: 0.0158  max mem: 5640
Test:  [24/25]  eta: 0:00:00  class_error: 100.00  loss: 41.2935 (41.5760)  loss_ce: 0.7405 (0.7177)  loss_bbox: 2.8130 (2.8165)  loss_giou: 2.9177 (2.9297)  loss_ce_0: 0.7413 (0.7207)  loss_bbox_0: 3.7636 (3.8434)  loss_giou_0: 3.2680 (3.2679)  loss_ce_1: 0.7550 (0.7228)  loss_bbox_1: 2.8832 (2.9293)  loss_giou_1: 3.0061 (3.0050)  loss_ce_2: 0.7492 (0.7245)  loss_bbox_2: 3.6117 (3.6901)  loss_giou_2: 3.2321 (3.2339)  loss_ce_3: 0.7438 (0.7199)  loss_bbox_3: 2.6811 (2.7477)  loss_giou_3: 2.8856 (2.9175)  loss_ce_4: 0.7476 (0.7257)  loss_bbox_4: 2.8724 (2.9254)  loss_giou_4: 2.9195 (2.9384)  loss_ce_unscaled: 0.7405 (0.7177)  class_error_unscaled: 100.0000 (100.0000)  loss_bbox_unscaled: 0.5626 (0.5633)  loss_giou_unscaled: 1.4588 (1.4649)  cardinality_error_unscaled: 3.0000 (3.1400)  loss_ce_0_unscaled: 0.7413 (0.7207)  loss_bbox_0_unscaled: 0.7527 (0.7687)  loss_giou_0_unscaled: 1.6340 (1.6339)  cardinality_error_0_unscaled: 3.0000 (3.1400)  loss_ce_1_unscaled: 0.7550 (0.7228)  loss_bbox_1_unscaled: 0.5766 (0.5859)  loss_giou_1_unscaled: 1.5031 (1.5025)  cardinality_error_1_unscaled: 3.0000 (3.1400)  loss_ce_2_unscaled: 0.7492 (0.7245)  loss_bbox_2_unscaled: 0.7223 (0.7380)  loss_giou_2_unscaled: 1.6160 (1.6169)  cardinality_error_2_unscaled: 3.0000 (3.1400)  loss_ce_3_unscaled: 0.7438 (0.7199)  loss_bbox_3_unscaled: 0.5362 (0.5495)  loss_giou_3_unscaled: 1.4428 (1.4588)  cardinality_error_3_unscaled: 3.0000 (3.1400)  loss_ce_4_unscaled: 0.7476 (0.7257)  loss_bbox_4_unscaled: 0.5745 (0.5851)  loss_giou_4_unscaled: 1.4597 (1.4692)  cardinality_error_4_unscaled: 3.0000 (3.1400)  time: 0.1772  data: 0.0158  max mem: 5640
Test: Total time: 0:00:04 (0.1923 s / it)
Averaged stats: class_error: 100.00  loss: 41.2935 (41.5760)  loss_ce: 0.7405 (0.7177)  loss_bbox: 2.8130 (2.8165)  loss_giou: 2.9177 (2.9297)  loss_ce_0: 0.7413 (0.7207)  loss_bbox_0: 3.7636 (3.8434)  loss_giou_0: 3.2680 (3.2679)  loss_ce_1: 0.7550 (0.7228)  loss_bbox_1: 2.8832 (2.9293)  loss_giou_1: 3.0061 (3.0050)  loss_ce_2: 0.7492 (0.7245)  loss_bbox_2: 3.6117 (3.6901)  loss_giou_2: 3.2321 (3.2339)  loss_ce_3: 0.7438 (0.7199)  loss_bbox_3: 2.6811 (2.7477)  loss_giou_3: 2.8856 (2.9175)  loss_ce_4: 0.7476 (0.7257)  loss_bbox_4: 2.8724 (2.9254)  loss_giou_4: 2.9195 (2.9384)  loss_ce_unscaled: 0.7405 (0.7177)  class_error_unscaled: 100.0000 (100.0000)  loss_bbox_unscaled: 0.5626 (0.5633)  loss_giou_unscaled: 1.4588 (1.4649)  cardinality_error_unscaled: 3.0000 (3.1400)  loss_ce_0_unscaled: 0.7413 (0.7207)  loss_bbox_0_unscaled: 0.7527 (0.7687)  loss_giou_0_unscaled: 1.6340 (1.6339)  cardinality_error_0_unscaled: 3.0000 (3.1400)  loss_ce_1_unscaled: 0.7550 (0.7228)  loss_bbox_1_unscaled: 0.5766 (0.5859)  loss_giou_1_unscaled: 1.5031 (1.5025)  cardinality_error_1_unscaled: 3.0000 (3.1400)  loss_ce_2_unscaled: 0.7492 (0.7245)  loss_bbox_2_unscaled: 0.7223 (0.7380)  loss_giou_2_unscaled: 1.6160 (1.6169)  cardinality_error_2_unscaled: 3.0000 (3.1400)  loss_ce_3_unscaled: 0.7438 (0.7199)  loss_bbox_3_unscaled: 0.5362 (0.5495)  loss_giou_3_unscaled: 1.4428 (1.4588)  cardinality_error_3_unscaled: 3.0000 (3.1400)  loss_ce_4_unscaled: 0.7476 (0.7257)  loss_bbox_4_unscaled: 0.5745 (0.5851)  loss_giou_4_unscaled: 1.4597 (1.4692)  cardinality_error_4_unscaled: 3.0000 (3.1400)
Accumulating evaluation results...
DONE (t=0.08s).
IoU metric: bbox
 **Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.002**

Could you explain this to me, please?
What is the parameters I must to change?
Thank you.

Unable to understand positional encoding and masks.

Can someone please explain me how you calculated the positional encoding?
I know what positional encoding is, but models.positional_encoding.py is but overwhelming. I want to know what are considered as positional encoding while working with images. Are these calculated for feature maps or somewhat else?
How do you calculate masks when using images in transformers?
I know what masks are, but how do we calculate these when dealing with images?

I found no answers to these questions anywhere so posting it here.

Demo can be exported to ONNX but other pretrained models cannot

Instructions To Reproduce the Issue:

run torch.onnx.export on the demo model provided here and on a model from torchhub. The demo model is successfully exported while other models fail.

#works
torch.onnx.export(detr_demo, sample_input, 'detr_demo.onnx', opset_version = 10)

#does not work
detr = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
detr.eval()
torch.onnx.export(detr, sample_input, 'detr.onnx', opset_version = 10)

see full code here

The error log is as follows:

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:59: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:60: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
/usr/local/lib/python3.6/dist-packages/torch/tensor.py:467: RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  'incorrect results).', category=RuntimeWarning)
/root/.cache/torch/hub/facebookresearch_detr_master/util/misc.py:294: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  batch_shape = (len(tensor_list),) + max_size
/root/.cache/torch/hub/facebookresearch_detr_master/util/misc.py:301: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
/root/.cache/torch/hub/facebookresearch_detr_master/util/misc.py:302: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  m[: img.shape[1], :img.shape[2]] = False

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-19-968e97398387> in <module>()
     11 
     12 torch.onnx.export(detr_demo, sample_input, 'detr_demo.onnx', opset_version = 10)
---> 13 torch.onnx.export(detr, sample_input, 'detr.onnx', opset_version = 10)

8 frames

/usr/local/lib/python3.6/dist-packages/torch/onnx/__init__.py in export(model, args, f, export_params, verbose, training, input_names, output_names, aten, export_raw_ir, operator_export_type, opset_version, _retain_param_name, do_constant_folding, example_outputs, strip_doc_string, dynamic_axes, keep_initializers_as_inputs, custom_opsets, enable_onnx_checker, use_external_data_format)
    166                         do_constant_folding, example_outputs,
    167                         strip_doc_string, dynamic_axes, keep_initializers_as_inputs,
--> 168                         custom_opsets, enable_onnx_checker, use_external_data_format)
    169 
    170 

/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py in export(model, args, f, export_params, verbose, training, input_names, output_names, aten, export_raw_ir, operator_export_type, opset_version, _retain_param_name, do_constant_folding, example_outputs, strip_doc_string, dynamic_axes, keep_initializers_as_inputs, custom_opsets, enable_onnx_checker, use_external_data_format)
     67             dynamic_axes=dynamic_axes, keep_initializers_as_inputs=keep_initializers_as_inputs,
     68             custom_opsets=custom_opsets, enable_onnx_checker=enable_onnx_checker,
---> 69             use_external_data_format=use_external_data_format)
     70 
     71 

/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py in _export(model, args, f, export_params, verbose, training, input_names, output_names, operator_export_type, export_type, example_outputs, propagate, opset_version, _retain_param_name, do_constant_folding, strip_doc_string, dynamic_axes, keep_initializers_as_inputs, fixed_batch_size, custom_opsets, add_node_names, enable_onnx_checker, use_external_data_format)
    486                                                         example_outputs, propagate,
    487                                                         _retain_param_name, val_do_constant_folding,
--> 488                                                         fixed_batch_size=fixed_batch_size)
    489 
    490         # TODO: Don't allocate a in-memory string for the protobuf

/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py in _model_to_graph(model, args, verbose, training, input_names, output_names, operator_export_type, example_outputs, propagate, _retain_param_name, do_constant_folding, _disable_torch_constant_prop, fixed_batch_size)
    349     graph = _optimize_graph(graph, operator_export_type,
    350                             _disable_torch_constant_prop=_disable_torch_constant_prop,
--> 351                             fixed_batch_size=fixed_batch_size, params_dict=params_dict)
    352 
    353     if isinstance(model, torch.jit.ScriptModule) or isinstance(model, torch.jit.ScriptFunction):

/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict)
    152         torch._C._jit_pass_erase_number_types(graph)
    153 
--> 154         graph = torch._C._jit_pass_onnx(graph, operator_export_type)
    155         torch._C._jit_pass_lint(graph)
    156 

/usr/local/lib/python3.6/dist-packages/torch/onnx/__init__.py in _run_symbolic_function(*args, **kwargs)
    197 def _run_symbolic_function(*args, **kwargs):
    198     from torch.onnx import utils
--> 199     return utils._run_symbolic_function(*args, **kwargs)
    200 
    201 

/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py in _run_symbolic_function(g, n, inputs, env, operator_export_type)
    738                                   .format(op_name, opset_version, op_name))
    739                 op_fn = sym_registry.get_registered_op(op_name, '', opset_version)
--> 740                 return op_fn(g, *inputs, **attrs)
    741 
    742         elif ns == "prim":

/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in wrapper(g, *args)
    127             assert len(arg_descriptors) >= len(args)
    128             args = [_parse_arg(arg, arg_desc) for arg, arg_desc in zip(args, arg_descriptors)]
--> 129             return fn(g, *args)
    130         # In Python 2 functools.wraps chokes on partially applied functions, so we need this as a workaround
    131         try:

/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_opset9.py in ones(g, sizes, dtype, layout, device, pin_memory)
   1409         dtype = 6  # float
   1410     return g.op("ConstantOfShape", sizes,
-> 1411                 value_t=torch.tensor([1], dtype=sym_help.scalar_type_to_pytorch_type[dtype]))
   1412 
   1413 

IndexError: list index out of range

Expected behavior:

It should be possible to export a model from torchhub similar to the demo model.

Environment:

Google colab

Learning loss coeffs

Have you experimented any techniques to learning losses coeffs (from Multi-task learning literature) hard-coded at https://github.com/facebookresearch/detr/blob/master/main.py#L73?

Edit:
E.g. Just to make a recent example https://arxiv.org/abs/2001.02223

custom training asserts with "degenerate bboxes" over and over - but bboxes look correct, any debugging insight?

I'm trying to get my custom dataset working but I can't get past 8 or so images via get_item and it keeps asserting that my bboxes are bad..I pull that one, it flags the next one, I pull that one, it flags the next...

From reading the code it wants to check that x1 and y1 are larger than x0 and y0 which is a great check.

55   assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

But it keeps flagging images that when I unwind from coco format should be fine...thus any insights? I was not able to print the boxes1 (200,4) and boxes 2 (12,4) tensors for some reason so I couldn't see into what it was actually calculating for the results (threw an odd gpu issue with 'formatting').

Example it flagged this image as being bad - here's the JSON for it in coco format, 6 classes. 1 box will surround all the other 5 objects btw as it's a malaria reader, so not sure if that box encompassing other boxes is really the issue?):

{"id": "c33c3539-8bd1-48e0-8065-831709e5e64d", "image_id": 3091210, "category_id": 2905442, "segmentation": null, "area": 0, "bbox": **[499, 121, 177, 80]**, "iscrowd": 0}, 
{"id": "0023d71e-e1e9-4862-a0b8-6e2bc3982b3b", "image_id": 3091210, "category_id": 2905422, "segmentation": null, "area": 0, "bbox": **[492, 523, 187, 163]**, "iscrowd": 0},
 {"id": "726fdfbc-3801-409d-ab75-ccf951e74316", "image_id": 3091210, "category_id": 2905421, "segmentation": null, "area": 0, "bbox": **[496, 428, 181, 93],** "iscrowd": 0}, 
{"id": "2bf85a8e-108d-4875-b0f5-47c8e5cb13e0", "image_id": 3091210, "category_id": 2905420, "segmentation": null, "area": 0, "bbox": **[494, 272, 186, 169]**, "iscrowd": 0},
 {"id": "8669c13a-1205-4e94-a645-18e2ffa491d0", "image_id": 3091210, "category_id": 2905419, "segmentation": null, "area": 0, "bbox": **[489, 127, 193, 557]**, "iscrowd": 0},
 {"id": "d9619859-e0ef-4632-ad51-7237a5760a5e", "image_id": 3091210, "category_id": 2905418, "segmentation": null, "area": 0, "bbox": **[495, 203, 182, 73]**, "iscrowd": 0},

And as a check for me, here's coco format:
The COCO bounding box format is [top left x position, top left y position, width, height].

All the bboxes which it flags, are positive numbers for width and height, so the x1 and y1 must be larger than x0 and y0 - only a negative number added to the original x0 or y0 could result in it being smaller...so I'm unclear what it is asserting on or for.

But it asserts here:

~/detr/util/box_ops.py in generalized_box_iou(boxes1, boxes2)
     53     #print(boxes1)
     54     #print(boxes2)
---> 55     assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
     56     assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
     57     iou, union = box_iou(boxes1, boxes2)

I've removed 15+ images trying to get it to actually train but just keeps flagging more and more as invalid bboxes. I remove one image, then it asserts on the next one...and in reviewing the ones it flags vs the ones it lets pass, I don't see any real difference. (I have trained with this same dataset on EffficientDet so I know the dataset is reasonable).

Thus any help into debugging, or what might be awry would be appreciated.
Thanks!

What is the class id for background class?

I have a quick question: is the background class id 0 or 91? (DETR used 91 COCO categories to train)
It seems the targets object returned by dataloader uses 1-91 for all the object categories, but the loss_labels function used 91 instead of 0 for background. I am not sure if I missed something.

Thanks.

Problems with detect large objects?

Inference result on some images coco:

Custom dataset training

❓ How to do something using DETR

I am trying to train the resnet50 model with one more class on top of the coco dataset. So I loaded the pretrained model like this -

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)

and then i am unfreezing class_embed and bbox_embed

for param in model.parameters():
    param.requires_grad = False

classifier_class = nn.Sequential(nn.Linear(256,128), 
                                 nn.ReLU(), 
                                 nn.Dropout(p=0.2), 
                                 nn.Linear(128,93), 
                                 #nn.LogSoftmax(dim=1)
                                 )

model.class_embed = classifier_class

classifier_bbox = nn.Sequential(nn.Linear(256,256), 
                                nn.ReLU(), 
                                nn.Dropout(p=0.2), 
                                nn.Linear(256,256),
                                nn.ReLU(),
                                nn.Dropout(p=0.2),
                                nn.Linear(256,4),
                                nn.Sigmoid()
                                )

And I am using build_model to get my criterion and postprocesses

dummy, criterion, postprocessors = build_model(data_args)

Optimizer:

optimizer = torch.optim.Adam([{'params': model.class_embed.parameters()}, 
                             {'params': model.bbox_embed.parameters()}], 
                             lr=data_args.lr, weight_decay=data_args.weight_decay)

Now I am loading only 'skyscraper' class using data_loader.

Unfortunately I am getting this error:

RuntimeError: weight tensor should be defined either for all or no classes at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:27

Here is the entire code:
https://colab.research.google.com/drive/1L3PLEiOVICgmjyK6JIDjEBFmraVEQYhz?usp=sharing

Suggestion - change to ResNeST50 backbone (new split attention arch)

One idea to jump DETR's impressive results might be to swap in the new ResNeST50 backbone (released last month by Amazon AI and UCDavis).
In all of the architectures they tested, it immediately provided 3-4% AP boost for Coco.

This improvement also helps downstream tasks including object detection, instance segmentation and semantic segmentation. For example, by simply replace the ResNet-50 backbone with ResNeSt-50, we improve the mAP of Faster-RCNN on MS-COCO from 39.3% to 42.3% and the mIoU for DeeplabV3 on ADE20K from 42.1% to 45.1%.

It should plug and play right in. I've been using it for classification work and was a nice improvement there, and the concept of better global context maps to the improvements DETR is providing for the head architecture.

https://arxiv.org/abs/2004.08955v1
https://github.com/zhanghang1989/ResNeSt

(I plan to test this out on my own datasets, but will not have time to train it on Coco proper and I think conceptually it's a great match for DETR regardless).

nms parameters vs mAP

First of all, thanks for presenting a great paper. It's one of the most innovative papers I've read recently in computer vision and sure many works will follow.

I was interested in the mAP performance with nms in Fig.4.
Do stronger nms (like nms=0.5) have similar mAP performance curves?
Maybe the mAP gets worse since more positive predictions will be deleted..

transformer decoder paraller decoding

how to realize the transformer decoder paraller decoding

will it make any sense to use zero v in the first decoder layer?

As in your code, tgt of the decoderlayer was firstly assigned with zeros, and use these zeros as v to calculate a new ouput with qkv attention operation, take the pre-norm forward part for example:

    def forward_pre(self, tgt, memory,
                    tgt_mask: Optional[Tensor] = None,
                    memory_mask: Optional[Tensor] = None,
                    tgt_key_padding_mask: Optional[Tensor] = None,
                    memory_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None,
                    query_pos: Optional[Tensor] = None):
        tgt2 = self.norm1(tgt)
        q = k = self.with_pos_embed(tgt2, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt2 = self.norm2(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt2 = self.norm3(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
        tgt = tgt + self.dropout3(tgt2)
        return tgt

I mean, if it was the first decoderlayer, tgt was token-wisely zero, then tgt2 will be token-wisely same after the first layernorm, how will that make any sense to get weighed output from this tgt2? No matter what the q and k is, nothing but a featureless bias will be learned I think.

Run panoptic segmentation

Great paper and repo btw, congrats!

I wanted to do a simple single image inference with the panoptic segmentation model. I archived that in a very 'hacky' way by editing the main.sh file (Can be seen here).

Are you planing to release a demo notebook like the one for object detection or uploading the .pth files to torch.Hub?

Can you share the training log for 300 epoches?

Thank you so much for sharing the log for 150 epochs.

Can you share 300 epochs?

Generate attention decoder heatmap images for model feedback as shown in DETR paper?

❓ How to do something using DETR

Describe what you want to do, including:

what inputs you will provide, if any:
what outputs you are expecting:

Will code be added/released to generate the attention decoder heatmaps like in the paper? (i.e. the zebra and elephant images).
I've found heatmaps to be very useful for helping with training and understanding model performance, so hoping the code used in the paper will be added so we can generate these for our DETR models?

NOTE:

Only general answers are provided.
If you want to ask about "why X did not work", please use the
Unexpected behaviors issue template.
About how to implement new models / new dataloader / new training logic, etc., check documentation first.
We do not answer general machine learning / computer vision questions that are not specific to DETR, such as how a model works, how to improve your training/make it converge, or what algorithm/methods can be used to achieve X.

Using the model for image classification only tasks

How can I leverage this architecture for image calssifcation tasks? I tried using the example in colab notebook but had trouble with batch sizes. The example is intended for batch_size = 1 but gives errors when using a larger batch size. How can I overcome this?

Consider CIoU or Distance IoU instead of GIou for loss function?

In EfficientDet there was an improvement gain switching to Distance IoU, and suspect the same would hold for DETR with either Distance or Complete IoU.

By incorporating DIoU and CIoU losses into state-of-the-art object detection algorithms, e.g., YOLO v3, SSD and Faster RCNN, we achieve notable performance gains in terms of not only IoU metric but also GIoU metric.

we consider three geometric factors, i.e., overlap area, normalized central point distance and aspect ratio, which are crucial for measuring bounding box regression in object detection and instance segmentation.
The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted ℓn-norm loss and IoU-based loss.

Here's the paper discussing CIoU:
https://arxiv.org/abs/2005.03572
and Distance IoU:
https://arxiv.org/abs/1911.08287

and most importantly code:
https://github.com/Zzh-tju/CIoU

How was the DETR demo trained?

tl;dr:

how was the demo model trained?
why does the batch size have to be 1?

Thanks for the amazing work!

I'm very intrigued by the simplicity of DETR, especially the inference demo code. I was wondering how the demo model was trained, since you guys do provide pretrained weights for it. I'm asking this particularly because the inference code says that it only supports a batch size of 1. Does the batch size have to be 1 during training? Also, why does it have to be 1, either in training or inference?

Thank you so much for your time!

Implementation not consistent with the original paper?

Hi.
In original paper, it mentioned in Sec. 4 that

To optimize for AP, we override the prediction of these slots with the second highest scoring class, using the corresponding condence. This improves AP by 2 points compared to filtering out empty slots.

But I didn't see any corresponding code in this repo. Did I miss something or it is not implement here?

Thank you.

postprocess

Hi, I didn't see nms in postprocess. Why you don't you nms and could you please explain how does postprocess work?

Do the model works well in transfer learning?

Thanks for the amazing work!

I noticed the training time for DETR is 3 days with multi GPUs. I believe this setting is too hard to achieve for most end users.

I would like to know in your study did you try transfer learning in DETR? if so, would you provide related module on that?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.