ruotianluo / pytorch-faster-rcnn Goto Github PK

pytorch1.0 updated. Support cpu test and demo. (Use detectron2, it's a masterpiece)

License: MIT License

Shell 0.14% Roff 0.02% MATLAB 0.02% Python 2.57% Jupyter Notebook 97.25%

pytorch-faster-rcnn's Introduction

Notice(2019.11.2)

This repo was built back two years ago when there were no pytorch detection implementation that can achieve reasonable performance. At this time, there are many better repos out there, for example:

detectron2(It's a masterpiece.)
mmdetection

Therefore, this repo will not be actively maintained.

Important notice:

If you used the master branch before Sep. 26 2017 and its corresponding pretrained model, PLEASE PAY ATTENTION: The old master branch in now under old_master, you can still run the code and download the pretrained model, but the pretrained model for that old master is not compatible to the current master!

The main differences between new and old master branch are in this two commits: 9d4c24e, c899ce7 The change is related to this issue; master now matches all the details in tf-faster-rcnn so that we can now convert pretrained tf model to pytorch model.

pytorch-faster-rcnn

A pytorch implementation of faster RCNN detection framework based on Xinlei Chen's tf-faster-rcnn. Xinlei Chen's repository is based on the python Caffe implementation of faster RCNN available here.

Note: Several minor modifications are made when reimplementing the framework, which give potential improvements. For details about the modifications and ablative analysis, please refer to the technical report An Implementation of Faster RCNN with Study for Region Sampling. If you are seeking to reproduce the results in the original paper, please use the official code or maybe the semi-official code. For details about the faster RCNN architecture please refer to the paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

Detection Performance

The current code supports VGG16, Resnet V1 and Mobilenet V1 models. We mainly tested it on plain VGG16 and Resnet101 architecture. As the baseline, we report numbers using a single model on a single convolution layer, so no multi-scale, no multi-stage bounding box regression, no skip-connection, no extra input is used. The only data augmentation technique is left-right flipping during training following the original Faster RCNN. All models are released.

With VGG16 (conv5_3):

Train on VOC 2007 trainval and test on VOC 2007 test, 71.22(from scratch) 70.75(converted) (70.8 for tf-faster-rcnn).
Train on VOC 2007+2012 trainval and test on VOC 2007 test (R-FCN schedule), 75.33(from scratch) 75.27(converted) (75.7 for tf-faster-rcnn).
Train on COCO 2014 trainval35k and test on minival (900k/1190k) 29.2(from scratch) 30.1(converted) (30.2 for tf-faster-rcnn).

With Resnet101 (last conv4):

Train on VOC 2007 trainval and test on VOC 2007 test, 75.29(from scratch) 75.76(converted) (75.7 for tf-faster-rcnn).
Train on VOC 2007+2012 trainval and test on VOC 2007 test (R-FCN schedule), 79.26(from scratch) 79.78(converted) (79.8 for tf-faster-rcnn).
Train on COCO 2014 trainval35k and test on minival (800k/1190k), 35.1(from scratch) 35.4(converted) （35.4 for tf-faster-rcnn).

More Results:

Train Mobilenet (1.0, 224) on COCO 2014 trainval35k and test on minival (900k/1190k), 21.4(from scratch), 21.9(converted) （21.8 for tf-faster-rcnn).
Train Resnet50 on COCO 2014 trainval35k and test on minival (900k/1190k), 32.4(converted) (32.4 for tf-faster-rcnn).
Train Resnet152 on COCO 2014 trainval35k and test on minival (900k/1190k), 36.7(converted) (36.1 for tf-faster-rcnn).

Approximate baseline setup from FPN (this repository does not contain training code for FPN yet):

Train Resnet50 on COCO 2014 trainval35k and test on minival (900k/1190k), ~~34.2~~.
Train Resnet101 on COCO 2014 trainval35k and test on minival (900k/1190k), ~~37.4~~.
Train Resnet152 on COCO 2014 trainval35k and test on minival (900k/1190k), ~~38.2~~.

Note:

Due to the randomness in GPU training especially for VOC, the best numbers are reported (with 2-3 attempts) here. According to Xinlei's experience, for COCO you can almost always get a very close number (within ~0.2%) despite the randomness.
The numbers are obtained with the default testing scheme which selects region proposals using non-maximal suppression (TEST.MODE nms), the alternative testing scheme (TEST.MODE top) will likely result in slightly better performance (see report, for COCO it boosts 0.X AP).
Since we keep the small proposals (< 16 pixels width/height), our performance is especially good for small objects.
We do not set a threshold (instead of 0.05) for a detection to be included in the final result, which increases recall.
Weight decay is set to 1e-4.
For other minor modifications, please check the report. Notable ones include using crop_and_resize, and excluding ground truth boxes in RoIs during training.
For COCO, we find the performance improving with more iterations, and potentially better performance can be achieved with even more iterations.
For Resnets, we fix the first block (total 4) when fine-tuning the network, and only use crop_and_resize to resize the RoIs (7x7) without max-pool (which Xinlei finds useless especially for COCO). The final feature maps are average-pooled for classification and regression. All batch normalization parameters are fixed. Learning rate for biases is not doubled.
For Mobilenets, we fix the first five layers when fine-tuning the network. All batch normalization parameters are fixed. Weight decay for Mobilenet layers is set to 4e-5.
For approximate FPN baseline setup we simply resize the image with 800 pixels, add 32^2 anchors, and take 1000 proposals during testing.
Check out here/here/here for the latest models, including longer COCO VGG16 models and Resnet ones.


Displayed Ground Truth on Tensorboard	Displayed Predictions on Tensorboard

Additional features

Additional features not mentioned in the report are added to make research life easier:

Support for train-and-validation. During training, the validation data will also be tested from time to time to monitor the process and check potential overfitting. Ideally training and validation should be separate, where the model is loaded every time to test on validation. However Xinlei have implemented it in a joint way to save time and GPU memory. Though in the default setup the testing data is used for validation, no special attempts is made to overfit on testing set.
Support for resuming training. Xinlei tried to store as much information as possible when snapshoting, with the purpose to resume training from the latest snapshot properly. The meta information includes current image index, permutation of images, and random state of numpy. However, when you resume training the random seed for tensorflow will be reset (not sure how to save the random state of tensorflow now), so it will result in a difference. Note that, the current implementation still cannot force the model to behave deterministically even with the random seeds set. Suggestion/solution is welcome and much appreciated.
Support for visualization. The current implementation will summarize ground truth boxes, statistics of losses, activations and variables during training, and dump it to a separate folder for tensorboard visualization. The computing graph is also saved for debugging.

Prerequisites

A basic pytorch installation. The code follows 1.0. If you are using old 0.1.12 or 0.2 or 0.3 or 0.4, you can checkout the corresponding branch.
Torchvision 0.3. This code uses torchvision.ops for nms, roi_pool and roi_align
Python packages you might not have: opencv-python, easydict (similar to py-faster-rcnn). For easydict make sure you have the right version. Xinlei uses 1.6.
tensorboard-pytorch to visualize the training and validation curve. Please build from source to use the latest tensorflow-tensorboard.
Docker users: Since the recent upgrade, the docker image on docker hub (https://hub.docker.com/r/mbuckler/tf-faster-rcnn-deps/) is no longer valid. However, you can still build your own image by using dockerfile located at docker folder (cuda 8 version, as it is required by Tensorflow r1.0.) And make sure following Tensorflow installation to install and use nvidia-docker[https://github.com/NVIDIA/nvidia-docker]. Last, after launching the container, you have to build the Cython modules within the running container.

Installation

Clone the repository

git clone https://github.com/ruotianluo/pytorch-faster-rcnn.git

Install the Python COCO API. The code requires the API to access COCO dataset.

cd data
git clone https://github.com/pdollar/coco.git
cd coco/PythonAPI
make
cd ../../..

Setup data

Please follow the instructions of py-faster-rcnn here to setup VOC and COCO datasets (Part of COCO is done). The steps involve downloading data and optionally creating soft links in the data folder. Since faster RCNN does not rely on pre-computed proposals, it is safe to ignore the steps that setup proposals.

If you find it useful, the data/cache folder created on Xinlei's side is also shared here.

Demo and Test with pre-trained models

Download pre-trained model (only google drive works)

~~Another server here.~~
Google drive here.

(Optional) Instead of downloading my pretrained or converted model, you can also convert from tf-faster-rcnn model. You can download the tensorflow pretrained model from tf-faster-rcnn. Then run:

python tools/convert_from_tensorflow.py --tensorflow_model resnet_model.ckpt 
python tools/convert_from_tensorflow_vgg.py --tensorflow_model vgg_model.ckpt

This script will create a .pth file with the same name in the same folder as the tensorflow model.

Create a folder and a soft link to use the pre-trained model

NET=res101
TRAIN_IMDB=voc_2007_trainval+voc_2012_trainval
mkdir -p output/${NET}/${TRAIN_IMDB}
cd output/${NET}/${TRAIN_IMDB}
ln -s ../../../data/voc_2007_trainval+voc_2012_trainval ./default
cd ../../..

Demo for testing on custom images

# at repository root
GPU_ID=0
CUDA_VISIBLE_DEVICES=${GPU_ID} ./tools/demo.py

Note: Resnet101 testing probably requires several gigabytes of memory, so if you encounter memory capacity issues, please install it with CPU support only. Refer to Issue 25.

Test with pre-trained Resnet101 models

GPU_ID=0
./experiments/scripts/test_faster_rcnn.sh $GPU_ID pascal_voc_0712 res101

Note: If you cannot get the reported numbers (79.8 on my side), then probably the NMS function is compiled improperly, refer to Issue 5.

Train your own model

Download pre-trained models and weights. The current code support VGG16 and Resnet V1 models. Pre-trained models are provided by pytorch-vgg and pytorch-resnet (the ones with caffe in the name), you can download the pre-trained models and set them in the data/imagenet_weights folder. For example for VGG16 model, you can set up like:

mkdir -p data/imagenet_weights
cd data/imagenet_weights
python # open python in terminal and run the following Python code

import torch
from torch.utils.model_zoo import load_url
from torchvision import models

sd = load_url("https://s3-us-west-2.amazonaws.com/jcjohns-models/vgg16-00b39a1b.pth")
sd['classifier.0.weight'] = sd['classifier.1.weight']
sd['classifier.0.bias'] = sd['classifier.1.bias']
del sd['classifier.1.weight']
del sd['classifier.1.bias']

sd['classifier.3.weight'] = sd['classifier.4.weight']
sd['classifier.3.bias'] = sd['classifier.4.bias']
del sd['classifier.4.weight']
del sd['classifier.4.bias']

torch.save(sd, "vgg16.pth")

cd ../..

For Resnet101, you can set up like:

mkdir -p data/imagenet_weights
cd data/imagenet_weights
# download from my gdrive (link in pytorch-resnet)
mv resnet101-caffe.pth res101.pth
cd ../..

For Mobilenet V1, you can set up like:

mkdir -p data/imagenet_weights
cd data/imagenet_weights
# download from my gdrive (https://drive.google.com/open?id=0B7fNdx_jAqhtZGJvZlpVeDhUN1k)
mv mobilenet_v1_1.0_224.pth.pth mobile.pth
cd ../..

Train (and test, evaluation)

./experiments/scripts/train_faster_rcnn.sh [GPU_ID] [DATASET] [NET]
# GPU_ID is the GPU you want to test on
# NET in {vgg16, res50, res101, res152} is the network arch to use
# DATASET {pascal_voc, pascal_voc_0712, coco} is defined in train_faster_rcnn.sh
# Examples:
./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16
./experiments/scripts/train_faster_rcnn.sh 1 coco res101

Note: Please double check you have deleted soft link to the pre-trained models before training. If you find NaNs during training, please refer to Issue 86. Also if you want to have multi-gpu support, check out Issue 121.

Visualization with Tensorboard

tensorboard --logdir=tensorboard/vgg16/voc_2007_trainval/ --port=7001 &
tensorboard --logdir=tensorboard/vgg16/coco_2014_train+coco_2014_valminusminival/ --port=7002 &

Test and evaluate

./experiments/scripts/test_faster_rcnn.sh [GPU_ID] [DATASET] [NET]
# GPU_ID is the GPU you want to test on
# NET in {vgg16, res50, res101, res152} is the network arch to use
# DATASET {pascal_voc, pascal_voc_0712, coco} is defined in test_faster_rcnn.sh
# Examples:
./experiments/scripts/test_faster_rcnn.sh 0 pascal_voc vgg16
./experiments/scripts/test_faster_rcnn.sh 1 coco res101

You can use tools/reval.sh for re-evaluation

By default, trained networks are saved under:

output/[NET]/[DATASET]/default/

Test outputs are saved under:

output/[NET]/[DATASET]/default/[SNAPSHOT]/

Tensorboard information for train and validation is saved under:

tensorboard/[NET]/[DATASET]/default/
tensorboard/[NET]/[DATASET]/default_val/

The default number of training iterations is kept the same to the original faster RCNN for VOC 2007, however Xinlei finds it is beneficial to train longer (see report for COCO), probably due to the fact that the image batch size is one. For VOC 07+12 we switch to a 80k/110k schedule following R-FCN. Also note that due to the nondeterministic nature of the current implementation, the performance can vary a bit, but in general it should be within ~1% of the reported numbers for VOC, and ~0.2% of the reported numbers for COCO. Suggestions/Contributions are welcome.

Citation

If you find this implementation or the analysis conducted in our report helpful, please consider citing:

@article{chen17implementation,
    Author = {Xinlei Chen and Abhinav Gupta},
    Title = {An Implementation of Faster RCNN with Study for Region Sampling},
    Journal = {arXiv preprint arXiv:1702.02138},
    Year = {2017}
}

For convenience, here is the faster RCNN citation:

@inproceedings{renNIPS15fasterrcnn,
    Author = {Shaoqing Ren and Kaiming He and Ross Girshick and Jian Sun},
    Title = {Faster {R-CNN}: Towards Real-Time Object Detection
             with Region Proposal Networks},
    Booktitle = {Advances in Neural Information Processing Systems ({NIPS})},
    Year = {2015}
}

Detailed numbers from COCO server (not supported)

All the models are trained on COCO 2014 trainval35k.

VGG16 COCO 2015 test-dev (900k/1190k):

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.297
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.504
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.312
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.128
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.325
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.421
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.272
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.399
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.409
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.187
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.451
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.591

VGG16 COCO 2015 test-std (900k/1190k):

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.295
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.501
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.312
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.119
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.327
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.418
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.273
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.400
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.409
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.179
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.455
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.586

pytorch-faster-rcnn's People

Contributors

Stargazers

Watchers

Forkers

baiyancheng20 benjamesbabala ml-lab zengxh jiasenlu bityangke feynman27 phil-bergmann taokong quxiaofeng amds123 chakkritte hzsydy xperzy wenhel ps793 lematt1991 gtxjinx guo2004131 tonylearn09 zenozhouzhao youmi-zym xychen9459 papercoming csgwon hsuxu bellatoris csrhddlam zhengkang86 singhranjodh hli2020 yuleichin zyxunh longcw marvis oya163 hahazhky intermodalics fireae closerbibi shafiahmed caomw jtdowdall isaac-duan hayesbh ionvision lukeandshuo manoja328 strongwolf ruichen96 smartape jwyang abcsup yuechengyin chenwgen acgtyrant brianlan chuanleiguo jiajiewang-xo firedfree tesagure chuckgithub marcvivet granjanrak1 gcheron fenglingm linamy85 yuckfu jeffreyyihuang spartag117 highclow wangzhenhua2015 yehao aachenhang bigredt mseals1 daemon edchengg zhanghaoinf lichengunc fanyishu123 ashstuff samuelschen wbb123 shuidongliu zhengyk11 mcusi erinchen824 griffinliang qiqihuang anrhine mazm13 kaihuatang zfh01234 gukyeongkwon xingyizhou buaacyh xiaoanshi liketheflower erfannoury

pytorch-faster-rcnn's Issues

libprotobuf error causes crash in the middle of training

I'm running into a strange error. During training (after several thousand iterations), my training script crashes with the error

libprotobuf FATAL google/protobuf/wire_format.cc:830] CHECK failed: (output->ByteCount()) == (expected_endpoint): : Protocol message serialized to a size different from what was originally expected.  Perhaps it was modified by another thread during serialization?
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  CHECK failed: (output->ByteCount()) == (expected_endpoint): : Protocol message serialized to a size different from what was originally expected.  Perhaps it was modified by another thread during serialization?
Command terminated by signal 6

It doesn't look like I'm running out of memory on my gpu. Could this be a result of writing too much information to tensorboard?

tensorflow-converted model failed in demo.py with 'KeyError'

I first convert from tf-faster-rcnn model. http://xinlei.sp.cs.cmu.edu/xinleic/tf-faster-rcnn/res101/voc_2007_50-70k.tgz & http://xinlei.sp.cs.cmu.edu/xinleic/tf-faster-rcnn/vgg16/voc_2007_50k-70k.tgz and run tool/demo.py, but failed with: (ubuntu14.04, cuda8.0, cudnn6, python3.6)

res101+pascal_voc

Traceback (most recent call last):
  File "tools/demo.py", line 143, in <module>
    net.load_state_dict(torch.load(saved_model))
  File "/home/ljy/pytorch-examples-master/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 481, in load_state_dict
    nn.Module.load_state_dict(self, {k: state_dict[k] for k in list(self.state_dict())})
  File "/home/ljy/pytorch-examples-master/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 481, in <dictcomp>
    nn.Module.load_state_dict(self, {k: state_dict[k] for k in list(self.state_dict())})
KeyError: 'resnet.bn1.bias'

vgg16+pascal_voc

Traceback (most recent call last):
  File "tools/demo.py", line 143, in <module>
    net.load_state_dict(torch.load(saved_model))
  File "/home/ljy/pytorch-examples-master/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 481, in load_state_dict
    nn.Module.load_state_dict(self, {k: state_dict[k] for k in list(self.state_dict())})
  File "/home/ljy/pytorch-examples-master/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 481, in <dictcomp>
    nn.Module.load_state_dict(self, {k: state_dict[k] for k in list(self.state_dict())})
KeyError: 'vgg.features.14.bias'

Something wrong with tensorflow-converted model?

can not find res101_faster_rcnn_iter_110000.pth

Hi, thank you for providing this pytorch implementation.
I git clone your code and find that res101_faster_rcnn_iter_110000.pth this is file is not provided.
I visit the google drive url that you provide and download the res50 folder (res50 is less than res101 so for quick download I just download this), and the file's name is voc_2007_50k-70k, without any extension name.

I just wonder is this the .pth model file ? Thanks.

crop vs roi pool

@ruotianluo , I tried to train the network with crop layer and roi pool layer separately, but the results of using roi pool layer is far worse than the crop one. Have you got this problem?

training issue

Sorry for disturbing~
I wish to know how can I train on VOC 2007+2012 dataset?
Thx for your reply!

The difference of roi_pooling and crop_roi_pooling

when i read your code, i find a layer named crop roi pooling that i never heard before,so could you tell the difference of roi_pooling and crop roi_pooling ,or give me some datum about them.thx

convert_from_tensorflow.py error

I use the ./data/scripts/fetch_faster_rcnn_models.sh
to download tf-faster-rcnn's voc_2007_trainval+voc_2012_trainval
and run:
python tools/convert_from_tensorflow.py --tensorflow_model data/voc_2007_trainval+voc_2012_trainval/res101_faster_rcnn_iter_110000.ckpt
which generate error as follow:

File "tools/convert_from_tensorflow.py", line 21, in <module>
    for k in var_dict.keys():
RuntimeError: dictionary changed size during iteration

I fixed this by change the line

for k in var_dict.keys():
    if 'Momentum' in k:
        del var_dict[k]

for k in list(var_dict.keys()):
    if 'Momentum' in k:
        del var_dict[k]

and

for k in var_dict.keys():
    if k.find('/') >= 0:
        var_dict['resnet' + k[k.find('/'):]] = var_dict[k]
        del var_dict[k]

for k in list(var_dict.keys()):
    if k.find('/') >= 0:
        var_dict['resnet' + k[k.find('/'):]] = var_dict[k]
        del var_dict[k]

seemed ok now.

No file name setup.py in lib/

There is no file named setup.py in lib directory. Is this not required?

undefined symbol: state

When I try to run the demo, I get the following error:
pattern@pattern-58:~/workspace/jiyy/pytorch-faster-rcnn$ CUDA_VISIBLE_DEVICES=0 ./tools/demo.py
Traceback (most recent call last):
File "./tools/demo.py", line 22, in
from model.test import im_detect
File "/home/pattern/workspace/jiyy/pytorch-faster-rcnn/tools/../lib/model/test.py", line 20, in
from model.nms_wrapper import nms
File "/home/pattern/workspace/jiyy/pytorch-faster-rcnn/tools/../lib/model/nms_wrapper.py", line 11, in
from nms.pth_nms import pth_nms
File "/home/pattern/workspace/jiyy/pytorch-faster-rcnn/tools/../lib/nms/pth_nms.py", line 2, in
from ._ext import nms
File "/home/pattern/workspace/jiyy/pytorch-faster-rcnn/tools/../lib/nms/_ext/nms/init.py", line 3, in
from ._nms import lib as _lib, ffi as _ffi
ImportError: /home/pattern/workspace/jiyy/pytorch-faster-rcnn/tools/../lib/nms/_ext/nms/_nms.so: undefined symbol: state

Question Regarding def _crop_pool_layer(self, bottom, rois, max_pool=True)

While computing the "theta" for the affine transformation, shouldn't the boundary of the box be:

top left corner: (x1-width)/width, (y1-height)/height
bottom right corner: (x2+1)/width, (y2+1)/height
?

That way if x1,y1=0 and x2,y2= width-1, height-1, the box maps to -1,-1, 1,1
as needed by the affine transformation functions in pytorch?

Why both "fg_inds.numel()=0 and bg_inds.numel()=0" not handled in proposal_target_layer.py?

In proposal_target_layer.py,

  # Small modification to the original version where we ensure a fixed number of regions are sampled
  if fg_inds.numel() > 0 and bg_inds.numel() > 0:
    ...
  elif fg_inds.numel() > 0:
    ...
  elif bg_inds.numel() > 0:
    ...
  else:
    import pdb
    pdb.set_trace()

The case when both fg_inds.numel()=0 and bg_inds.numel()=0 is not implemented.

When GT roi is not used in training rcnn (set via the cfg TRAIN.USE_GT flag), fg_inds.numel() can be zero.
Note, this is not the case for original faster rcnn where fg_inds.numel() != 0.

Also, the cfg TRAIN.BG_THRESH_HI and TRAIN.BG_THRESH_LO values can also cause bg_inds.numel()=0.

Since both "fg_inds.numel()=0 and bg_inds.numel()=0" is a possible case, I wonder why is it not handled?
In this case why don't just random select the roi and set them as negative?

Error in `python': free(): invalid pointer: 0x00007f3e41406b80

Hi, thanks for the release but I got this invalid pointer issue when trying to follow the tutorial to train/ test res101 on COCO dataset. The same issue also occurs when I tried to run the demo codes.

COCO (API and images) is set up as the official website and annotations come from rbgirshick to form the "minival5k" and "minus minival(35k)" datasets.
The model (resnet101) is downloaded from the link shown in tutorial: resnet101-caffe.pth -> res101.pth.

This error appears every time when I tried to run the train/ test scripts. It would be great if someone could help me since this issue confused me a long time after repeating the tutorial for several times. Thanks.

**************************** UPDATE *********************************************
After some light investigation, I've located the line (hope it's the only line...) that trigged the error. It's the importing operation in ./tools/trainval_net.py:
from model.train_val import get_training_roidb, train_net and I tested import model.train_val which still caused this issue.
After some checking, I found the issue is caused by the codes which is imports functions from utils such as:
from utils.bbox import bbox_overlaps, in ./lib/datasets/imdb.py
I have no idea what the utils is... I thought I'm being stupid so if someone knows the answer please let me know. It takes me a long time.. Thx.

misinterpretation of "from scratch"

Hi @ruotianluo,

I'm a little confused about the concept of "from scratch" in README. I think you still used the pre-trained models as the initial weights. So it is not suitable here with “from scratch” if referring to this paper: https://arxiv.org/abs/1708.01241.

tools/demo.py

Sorry for bothering u!
When I run the demo.py, I wish to see all the detection results on the one image(if I test a image.I wish it return only one image with all detected things.) but I don't know how to change the function....
thx for your reply.

Training on new dataset, the speed slows down after certain iteration and error shows

Hi, I tried to implemented faster-rcnn on other dataset with json anno. I have worked it out when I change the json anno into xml and follows voc dataset. Now I tred to use json and write a pytorch dataset. The thing is the speed slows down after certain iteration and error shows. I haven't change much on the training code especally network.py. Therefore I don't know what should I fix now.

python 2.7.5 cuda 8.0 Tesla M40
vgg16
7560 roidb entries
batch size set as default

some logs info

iter: 480 / 30000, total loss: 2.036504

rpn_loss_cls: 0.663102
rpn_loss_box: 0.285543
loss_cls: 0.645373
loss_box: 0.442487
lr: 0.001000
speed: 0.322s / iter
iter: 500 / 30000, total loss: 1.373579
rpn_loss_cls: 0.327806
rpn_loss_box: 0.184931
loss_cls: 0.655622
loss_box: 0.205220
lr: 0.001000
speed: 0.321s / iter
iter: 520 / 30000, total loss: 2.680187
rpn_loss_cls: 0.390241
rpn_loss_box: 0.156651
loss_cls: 1.424312
loss_box: 0.708983
lr: 0.001000
speed: 0.320s / iter
iter: 540 / 30000, total loss: 1.936543
rpn_loss_cls: 0.328758
rpn_loss_box: 0.171708
loss_cls: 1.037029
loss_box: 0.399048
lr: 0.001000
speed: 0.320s / iter
iter: 560 / 30000, total loss: 2.059262
rpn_loss_cls: 0.350970
rpn_loss_box: 0.098141
loss_cls: 1.055237
loss_box: 0.554913
lr: 0.001000
speed: 0.357s / iter
iter: 580 / 30000, total loss: 1.768692
rpn_loss_cls: 0.511994
rpn_loss_box: 0.590019
loss_cls: 0.494195
loss_box: 0.172484
lr: 0.001000
speed: 0.592s / iter
iter: 600 / 30000, total loss: 1.049125
rpn_loss_cls: 0.265679
rpn_loss_box: 0.031345
loss_cls: 0.579659
loss_box: 0.172441
lr: 0.001000
speed: 0.810s / iter
iter: 620 / 30000, total loss: 2.416317
rpn_loss_cls: 0.413436
rpn_loss_box: 0.700102
loss_cls: 0.684415
loss_box: 0.618365
lr: 0.001000
speed: 1.015s / iter
iter: 640 / 30000, total loss: 1.650612
rpn_loss_cls: 0.249911
rpn_loss_box: 0.438877
loss_cls: 0.689089
loss_box: 0.272735
lr: 0.001000
speed: 1.207s / iter
iter: 660 / 30000, total loss: 1.902411
rpn_loss_cls: 0.415372
rpn_loss_box: 0.171251
loss_cls: 0.832487
loss_box: 0.483302
lr: 0.001000
speed: 1.390s / iter
iter: 680 / 30000, total loss: 2.508811
rpn_loss_cls: 0.565063
rpn_loss_box: 0.264611
loss_cls: 1.119584
loss_box: 0.559552
lr: 0.001000
speed: 1.559s / iter
iter: 700 / 30000, total loss: 2.196737
rpn_loss_cls: 0.406803
rpn_loss_box: 0.110491
loss_cls: 1.045586
loss_box: 0.633857
lr: 0.001000
speed: 1.722s / iter
iter: 720 / 30000, total loss: 1.117774
rpn_loss_cls: 0.428924
rpn_loss_box: 0.198056
loss_cls: 0.389690
loss_box: 0.101104
lr: 0.001000
speed: 1.875s / iter
iter: 740 / 30000, total loss: 3.418862
rpn_loss_cls: 0.320257
rpn_loss_box: 1.186544
loss_cls: 1.309661
loss_box: 0.602400
lr: 0.001000
speed: 2.020s / iter
iter: 760 / 30000, total loss: 1.773256
rpn_loss_cls: 0.458810
rpn_loss_box: 0.155227
loss_cls: 0.826238
loss_box: 0.332982
lr: 0.001000
speed: 2.161s / iter
iter: 780 / 30000, total loss: 2.134902
rpn_loss_cls: 0.472494
rpn_loss_box: 0.458725
loss_cls: 0.873805
loss_box: 0.329878
lr: 0.001000
speed: 2.294s / iter
iter: 800 / 30000, total loss: 2.456363
rpn_loss_cls: 0.692896
rpn_loss_box: 1.136842
loss_cls: 0.377992
loss_box: 0.248633
lr: 0.001000
speed: 2.419s / iter
iter: 820 / 30000, total loss: 2.277763
rpn_loss_cls: 0.515828
rpn_loss_box: 0.368687
loss_cls: 0.889839
loss_box: 0.503409
lr: 0.001000
speed: 2.538s / iter
iter: 840 / 30000, total loss: 3.272842
rpn_loss_cls: 0.464894
rpn_loss_box: 1.077220
loss_cls: 1.147062
loss_box: 0.583667
lr: 0.001000
speed: 2.652s / iter
iter: 860 / 30000, total loss: 1.022741
rpn_loss_cls: 0.322352
rpn_loss_box: 0.068675
loss_cls: 0.437379
loss_box: 0.194335
lr: 0.001000
speed: 2.762s / iter
Traceback (most recent call last):
File "./tools/trainval_net.py", line 137, in
max_iters=args.max_iters)
File "/data1/ymchen/project/pytorch-faster-rcnn-master/tools/../lib/model/train_val.py", line 354, in train_net
sw.train_model(max_iters)
File "/data1/ymchen/project/pytorch-faster-rcnn-master/tools/../lib/model/train_val.py", line 256, in train_model
self.net.train_step_with_summary(blobs, self.optimizer)
File "/data1/ymchen/project/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 470, in train_step_with_summary
self.forward(blobs['data'], blobs['im_info'], blobs['gt_boxes'])
File "/data1/ymchen/project/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 395, in forward
rois, cls_prob, bbox_pred = self._predict()
File "/data1/ymchen/project/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 366, in _predict
rois = self._region_proposal(net_conv)
File "/data1/ymchen/project/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 258, in _region_proposal
rpn_labels = self._anchor_target_layer(rpn_cls_score)
File "/data1/ymchen/project/pytorch-faster-rcnn-master/tools/../lib/nets/network.py", line 136, in _anchor_target_layer
rpn_cls_score.data, self._gt_boxes.data.cpu().numpy(), self._im_info, self._feat_stride, self._anchors.data.cpu().numpy(), self._num_anchors)
AttributeError: 'NoneType' object has no attribute 'data'

TypeError: indexing a tensor with an object of type torch.cuda.LongTensor.

Dear ruotian, thank you for your great implementation and publication. However, after I finished the configure, I got the following error:

----------------type(order.data): <class 'torch.cuda.LongTensor'>
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 138, in <module>
    max_iters=args.max_iters)
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 353, in train_net
    sw.train_model(max_iters)
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 260, in train_model
    self.net.train_step_with_summary(blobs, self.optimizer)
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 453, in train_step_with_summary
    self.forward(blobs['data'], blobs['im_info'], blobs['gt_boxes'])
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 378, in forward
    rois, cls_prob, bbox_pred = self._predict()
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 350, in _predict
    rois = self._region_proposal(net_conv)
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 256, in _region_proposal
    rois, roi_scores = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred) # rois, roi_scores are varible
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 84, in _proposal_layer
    self._feat_stride, self._anchors, self._num_anchors)
  File "/home/jsptgpu/barry/pytorch-faster-rcnn/tools/../lib/layer_utils/proposal_layer.py", line 43, in proposal_layer
    proposals = proposals[order.data, :]
  File "/home/jsptgpu/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.py", line 69, in __getitem__
    return Index(key)(self)
  File "/home/jsptgpu/anaconda2/lib/python2.7/site-packages/torch/autograd/_functions/tensor.py", line 16, in forward
    result = i.index(self.index)
TypeError: indexing a tensor with an object of type torch.cuda.LongTensor. The only supported types are integers, slices, numpy scalars and torch.cuda.LongTensor or torch.cuda.ByteTensor as the only argument.
Command exited with non-zero status 1
423.86user 6.79system 7:11.02elapsed 99%CPU (0avgtext+0avgdata 2068360maxresident)k
0inputs+892960outputs (0major+2095481minor)pagefaults 0swaps

Any ideas what could cause the problem?

Runtime error when using crop pooling

Hi @ruotianluo , I meet the following runtime error when using crop pooling in CPU on vgg16 net.

I hope to get some solution. Thank you

CUDA 9.0

Hi,
is it possible use this project having CUDA 9.0?
Because I've just tried but I have this error when I run:
./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16

and maybe it's a problem of CUDA version:

RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at /pytorch/torch/lib/THC/generic/THCTensorIndex.cu:648
Command exited with non-zero status 1
25.60user 2.80system 0:33.87elapsed 83%CPU (0avgtext+0avgdata 1805900maxresident)k
2214176inputs+16outputs (2754major+311010minor)pagefaults 0swaps

Thanks in advance

Memory usage when train a model

Hi,
Do you know how to calculate the maximum GPU memory usage when train a model? So that I can change the SCALE size of input image.
Actually I found that when I trained the model with the original tensorflow version, the SCALE of input image could be larger than this pytorch version. (When training with a big SCALE(about 800 * 1000), pytorch version will be out of memory)

Thanks very much!

More than batch size 1

@brisker it's better to open a new issue.

In principle, the roi pooling and crop_and_resize are easy to extended to larger batch_size scenario. However, I don't know if other parts are built assuming batch_size one. (The original tf-faster-rcnn totally remove batch_size option and assuming batch_size 1.)

Memory leak

Hi!

So when I'm running demo.py on some images, it seems that the net object takes increasing amounts of space.

Specifically asizeof(net) (from pympler.sizeof) shows that the net object takes just over 1GB after loading dict state,
after detection on the first picture it takes about 5GB, after the third about 6.7GB, after that it stopts changing in size for 5-6 pictures, then takes 7.7GB, then back to 6.7GB, after that there's "out of memory" error. A really weird pattern!

I'm trying the default resnet 101 with all the defaults from demo.py.
My GPU is Nvidia GTX1080.

How can I fix this issue?

is "res101_faster_rcnn_iter_110000.pth" provided?

I am trying to run the evaluation code :./experiments/scripts/test_faster_rcnn.sh $GPU_ID pascal_voc_0712 res101.

It seems that ./data/scripts/fetch_faster_rcnn_models.sh gives me the tensorflow model "res101_faster_rcnn_iter_110000.ckpt.data-00000-of-00001/index/meta/ ... etc"

where can i find the pytorch model?

train on custom dataset

Hi,
Is there's a convenient way of training/fine-tune the network on custom dataset?

difference in your res101.pth and pytorch's provided https://s3.amazonaws.com/pytorch/models/resnet101-5d3b4d8f.pth

What's the difference between the imagenet resnet 101 model you provide (https://github.com/ruotianluo/pytorch-resnet) and that provided by pytorch at https://s3.amazonaws.com/pytorch/models/resnet101-5d3b4d8f.pth? You note that the two models are different. Can you give a brief summary how?

about the data order of roi_pooling (cpu version)

I have read the code in "lib/layer_utils/roi_pooling/src/roi_pooling.c" which is the cpu version of roi_pooling. I found that the data order follow the NHWC format when indexing bottom data, but pytorch follow the NCHW format. it is my mistake or something else?

Crash while training on COCO 2014

While training on COCO 2014 dataset, I'm running into the following crash:

both bg_inds.numel() and fg_inds.numel() is 0, which leads to a crash in keep_inds = torch.cat([fg_inds, bg_inds], 0)

I'm using pytorch on windows, running the program using the following parameters:

--imdb coco_2014_train+coco_2014_val --imdbval coco_2014_val --weight C:\path-to\pytorch-faster-rcnn\data\imagenet_weights\res50.pth

As you do in your training shell scripts, I added an extra element to the anchors:

cfg.ANCHOR_SCALES = [4, 8, 16, 32]

any ideas? Sounds like there should be an exit from the function if no background or foreground regions are found?

error when building NMS

I am using GTX 1080 Ti (sm_61), Python 3.6, ubuntu 16.04

nvcc -c -o nms_kernel.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_61
cd ../../
python build.py

I keep getting the following messages. Any suggestion on how to fix the issues?

building 'nms' extension
creating home
creating home/jianxuc
creating home/jianxuc/Projects
creating home/jianxuc/Projects/my_project
creating home/jianxuc/Projects/my_project/pytorch-faster-rcnn
creating home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib
creating home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms
creating home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/THC -I/usr/local/cuda/include -I/home/jianxuc/anaconda3/include/python3.6m -c nms.c -o ./nms.o
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/THC -I/usr/local/cuda/include -I/home/jianxuc/anaconda3/include/python3.6m -c /home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms.c -o ./home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms.o
In file included from /home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/TH.h:4:0,
from /home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms.c:1:
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms.c: In function ‘cpu_nms’:
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms.c:7:42: warning: passing argument 1 of ‘THLongTensor_isContiguous’ from incompatible pointer type [-Wincompatible-pointer-types]
THArgCheck(THLongTensor_isContiguous(boxes), 2, "boxes must be contiguous");
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THGeneral.h:78:35: note: in definition of macro ‘THArgCheck’
THArgCheck(FILE, LINE, VA_ARGS);
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THTensor.h:8:39: note: expected ‘const THLongTensor * {aka const struct THLongTensor *}’ but argument is of type ‘THFloatTensor * {aka struct THFloatTensor *}’
#define THTensor(NAME) TH_CONCAT_4(TH,Real,Tensor,NAME)
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THGeneral.h:116:37: note: in definition of macro ‘TH_CONCAT_4_EXPAND’
#define TH_CONCAT_4_EXPAND(x,y,z,w) x ## y ## z ## w
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THTensor.h:8:27: note: in expansion of macro ‘TH_CONCAT_4’
#define THTensor(NAME) TH_CONCAT_4(TH,Real,Tensor,NAME)
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/generic/THTensor.h:115:12: note: in expansion of macro ‘THTensor_’
TH_API int THTensor_(isContiguous)(const THTensor self);
^
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms.c:9:42: warning: passing argument 1 of ‘THLongTensor_isContiguous’ from incompatible pointer type [-Wincompatible-pointer-types]
THArgCheck(THLongTensor_isContiguous(areas), 4, "areas must be contiguous");
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THGeneral.h:78:35: note: in definition of macro ‘THArgCheck’
THArgCheck(FILE, LINE, VA_ARGS);
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THTensor.h:8:39: note: expected ‘const THLongTensor * {aka const struct THLongTensor *}’ but argument is of type ‘THFloatTensor * {aka struct THFloatTensor *}’
#define THTensor(NAME) TH_CONCAT_4(TH,Real,Tensor_,NAME)
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THGeneral.h:116:37: note: in definition of macro ‘TH_CONCAT_4_EXPAND’
#define TH_CONCAT_4_EXPAND(x,y,z,w) x ## y ## z ## w
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/THTensor.h:8:27: note: in expansion of macro ‘TH_CONCAT_4’
#define THTensor_(NAME) TH_CONCAT_4(TH,Real,Tensor_,NAME)
^
/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH/generic/THTensor.h:115:12: note: in expansion of macro ‘THTensor_’
TH_API int THTensor_(isContiguous)(const THTensor self);
^
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH -I/home/jianxuc/anaconda3/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/THC -I/usr/local/cuda/include -I/home/jianxuc/anaconda3/include/python3.6m -c /home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.c -o ./home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.o
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.c: In function ‘gpu_nms’:
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.c:29:35: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
unsigned long long mask_flat = THCudaLongTensor_data(state, mask);
^
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.c:37:40: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
unsigned long long * mask_cpu_flat = THLongTensor_data(mask_cpu);
^
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.c:40:39: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
unsigned long long remv_cpu_flat = THLongTensor_data(remv_cpu);
^
/home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.c:23:7: warning: unused variable ‘boxes_dim’ [-Wunused-variable]
int boxes_dim = THCudaTensor_size(state, boxes, 1);
^
gcc -pthread -shared -L/home/jianxuc/anaconda3/lib -Wl,-rpath=/home/jianxuc/anaconda3/lib,--no-as-needed ./_nms.o ./home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms.o ./home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/nms_cuda.o /home/jianxuc/Projects/my_project/pytorch-faster-rcnn/lib/nms/src/cuda/nms_kernel.cu.o -L/home/jianxuc/anaconda3/lib -lpython3.6m -o ./_nms.so

Windows Build

Has anyone tried this on windows? I'm in the process of trying to get it to work, but wanted to see if anyone has already tried and has any suggestions.

error in resnet google drive file

Hi
I downloaded resnet101 weights from google drive (res101_faster_rcnn_iter_1190000.pth)
when i run demo.py i see ioerror as
res101_faster_rcnn_iter_110000.pth not found
how can i download this file?
when i change 1190000.pth to 110000.pth i see error about tensor sizes.
thanks alot

ImportError: No module named google.protobuf Command exited with non-zero status 1

I facing the problem while training the model.

ImportError: No module named google.protobuf
Command exited with non-zero status 1
When I try to install protobuf I got
pip install protobuf
Requirement already satisfied: protobuf in /usr/local/lib/python2.7/dist-packages

bbox_inside_weights/outside_weights

What is the significance of bbox_inside and outside weights in smooth_l1_loss? I followed the fast and faster R-CNN papers closely and not seeing them being defined anywhere?

Error when running demo.py: No module named '_ext'

Can you explain what this import is doing in path_nms.py?
from _ext import nms
It's preventing me from running the demo script. I'm using Python 3.6.

_roi_pooling.so: undefined symbol: _Py_Dealloc

Hi,
I very much appreciate you made this repository publicly available to use. Before getting my hands dirtier, I tried to run ./tools/demo.py; however, got the following error:

GPU_ID=0
CUDA_VISIBLE_DEVICES=${GPU_ID} ./tools/demo.py
Traceback (most recent call last):
  File "./tools/demo.py", line 29, in <module>
    from nets.vgg16 import vgg16
  File "/home/sam/projects/faster.rcnn.pytorch/tools/../lib/nets/vgg16.py", line 10, in <module>
    from nets.network import Network
  File "/home/sam/projects/faster.rcnn.pytorch/tools/../lib/nets/network.py", line 28, in <module>
    from layer_utils.roi_pooling.roi_pool import RoIPoolFunction
  File "/home/sam/projects/faster.rcnn.pytorch/tools/../lib/layer_utils/roi_pooling/roi_pool.py", line 3, in <module>
    from _ext import roi_pooling
  File "/home/sam/projects/faster.rcnn.pytorch/tools/../lib/layer_utils/roi_pooling/_ext/roi_pooling/__init__.py", line 3, in <module>
    from ._roi_pooling import lib as _lib, ffi as _ffi
ImportError: /home/sam/projects/faster.rcnn.pytorch/tools/../lib/layer_utils/roi_pooling/_ext/roi_pooling/_roi_pooling.so: undefined symbol: _Py_Dealloc

Any ideas what could cause the problem?

running predictor for specific boxes

Hello,

Is it possible to run the fRCNN for specific boxes (e.g. hand drawn) without using RPN? if so, do you have any pointers for where to start doing that?

thanks in advance!

Stride for layer4 in resnet

@philokey, I will create a new branch for this so that the current pretrained won't be messed up.

This issue was originally opened in ruotianluo/pytorch-resnet#2

Command terminated by signal 11

when i implement the test experiment using command

GPU_ID=0
./experiments/scripts/test_faster_rcnn.sh $GPU_ID pascal_voc_0712 res101

one error happened " Command terminated by signal 11".
Moreover, when i run the demo in "/tools" folder, the segfalt (core dump) appears although we get the expected object detection results in tested images.

Run into memory error during training

Hi, I am currently training my own dataset using your framework. I was able to train with 110000 iterations (following VOC07+12 protocol), however when I enlarge the number of iteration to 490000, it stopped at iteration 180000 with the following error message... Do you have any idea what could be the reason? Thank you very much!

Traceback (most recent call last):
File "./tools/trainval_net.py", line 138, in
max_iters=args.max_iters)
File "/home/cwang/project/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 348, in train_net
sw.train_model(max_iters)
File "/home/cwang/project/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 255, in train_model
self.net.train_step_with_summary(blobs, self.optimizer)
File "/home/cwang/project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 478, in train_step_with_summary
summary = self._run_summary_op()
File "/home/cwang/project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 347, in _run_summary_op
summaries += self._add_act_summary(key, var)
File "/home/cwang/project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 66, in _add_act_summary
return tb.summary.histogram('ACT/' + key + '/activations', tensor.data.cpu().numpy(), bins='auto'),
File "/home/cwang/anaconda3/envs/tf11_py27/lib/python2.7/site-packages/tensorboardX/summary.py", line 113, in histogram
hist = make_histogram(values.astype(float), bins)
File "/home/cwang/anaconda3/envs/tf11_py27/lib/python2.7/site-packages/tensorboardX/summary.py", line 131, in make_histogram
bucket=counts)
File "/home/cwang/anaconda3/envs/tf11_py27/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 508, in init
copy.extend(field_value)
File "/home/cwang/anaconda3/envs/tf11_py27/lib/python2.7/site-packages/google/protobuf/internal/containers.py", line 275, in extend
new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
MemoryError
Command exited with non-zero status 1

dlopen: cannot load any more object with static TLS

When I followed step 2.Train in Train your own model, I used command ./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc res50 and an error occurred:
File "/usr/local/lib/python2.7/dist-packages/torch/init.py", line 54, in
from torch._C import *
ImportError: dlopen: cannot load any more object with static TLS
Command exited with non-zero status 1

Anybody knows how to solve this?

Poor results

Hi,
I using TITANX, Ubuntu 16.04, Cuda8, Cudnn 5, and train the model as readme. However, I can not get the result mentioned in README. I have two problems.

In the beginning of training, sometime the loss will become nan , it seems there are something wrong in initialization, but I can not find the bug in the code.

iter: 20 / 110000, total loss: nan
 >>> rpn_loss_cls: 0.691554
 >>> rpn_loss_box: 0.019584
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000
speed: 0.382s / iter

When losses are not nan, I get pool results: Mean AP = 0.6542 or Mean AP = 0.5809. Train on VOC 2007+2012 trainval and test on VOC 2007 by using default config.

Can you help me?

undefined symbol: __cudaUnregisterFatBinary

when I start trainning.
pytorch-faster-rcnn/tools/../lib/nms/_ext/nms/_nms.so: undefined symbol: __cudaUnregisterFatBinary
Is something wrong with my cuda enviroment?

Could this repo be applied to train own data?

Could this repo be applied to train own data except for COCO, VOC? Is it OK that I have a dataset with binary classes? It is appreciated if there is more guidance.

Zero RPN bias

Hi @ruotianluo , i did a train on new model on PASVAL_VOC dataset with Vgg16 (CPU only with roi_pooling, Python 3.6.3) downloaded from https://github.com/jcjohnson/pytorch-vgg. However, i found that there is always zero at the bias of rpn_net, rpn_cls_score_net, rpn_bbox_pred_net, cls_score_net, bbox_pred_net. Is it an error ? If yes, may i know which part should be modified or change ?

Evaluation Error with Annotation Path

I'm assuming this is just another Python3-induced error, but trying to debug and fix is challenging with such a big project, so any ideas would help a lot.

I'm not sure exactly how it's intended to work, but it seems as though for some reason it saved an empty pickle file called test.txt_annots.pkl in the folder /home/leow/pytorch-faster-rcnn/data/VOCdevkit2007/VOC2007/ImageSets/Main/ (I printed this path out in the output below), and when it tries to load this it errors.

Any ideas of where this could have went wrong?

`Evaluating detections
Writing aeroplane VOC results file
Writing bicycle VOC results file
Writing bird VOC results file
Writing boat VOC results file
Writing bottle VOC results file
Writing bus VOC results file
Writing car VOC results file
Writing cat VOC results file
Writing chair VOC results file
Writing cow VOC results file
Writing diningtable VOC results file
Writing dog VOC results file
Writing horse VOC results file
Writing motorbike VOC results file
Writing person VOC results file
Writing pottedplant VOC results file
Writing sheep VOC results file
Writing sofa VOC results file
Writing train VOC results file
Writing tvmonitor VOC results file
VOC07 metric? Yes
/home/leow/pytorch-faster-rcnn/data/VOCdevkit2007/VOC2007/ImageSets/Main/test.txt_annots.pkl
Traceback (most recent call last):
File "/home/leow/pytorch-faster-rcnn/tools/../lib/datasets/voc_eval.py", line 128, in voc_eval
recs = pickle.load(f)
EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./tools/test_net.py", line 117, in
test_net(net, imdb, filename, max_per_image=args.max_per_image)
File "/home/leow/pytorch-faster-rcnn/tools/../lib/model/test.py", line 194, in test_net
imdb.evaluate_detections(all_boxes, output_dir)
File "/home/leow/pytorch-faster-rcnn/tools/../lib/datasets/pascal_voc.py", line 280, in evaluate_detections
self._do_python_eval(output_dir)
File "/home/leow/pytorch-faster-rcnn/tools/../lib/datasets/pascal_voc.py", line 243, in _do_python_eval
use_07_metric=use_07_metric)
File "/home/leow/pytorch-faster-rcnn/tools/../lib/datasets/voc_eval.py", line 130, in voc_eval
recs = pickle.load(f, encoding='bytes')
EOFError: Ran out of input
`

Update your -arch in setup script to match your GPU？

Is this line in README outdated? I don't find any setup.py in lib.

An error happened when running build.py of roi pooling

Eecuse me if anyone get this problem like this,can you tell me how to solve this problem please.

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -DWITH_CUDA -I/usr/local/lib/python2.7/dist-packages/torch/utils/ffi/../../lib/include -I/usr/local/lib/python2.7/dist-packages/torch/utils/ffi/../../lib/include/TH -I/usr/local/lib/python2.7/dist-packages/torch/utils/ffi/../../lib/include/THC -I/usr/include/python2.7 -c _roi_pooling.c -o ./_roi_pooling.o
In file included from /usr/local/lib/python2.7/dist-packages/torch/utils/ffi/../../lib/include/THC/THC.h:4:0,
from _roi_pooling.c:483:
/usr/local/lib/python2.7/dist-packages/torch/utils/ffi/../../lib/include/THC/THCGeneral.h:9:18: fatal error: cuda.h: No such file or directory
compilation terminated.
Traceback (most recent call last):
File "build.py", line 34, in
ffi.build()
File "/usr/local/lib/python2.7/dist-packages/torch/utils/ffi/init.py", line 164, in build
_build_extension(ffi, cffi_wrapper_name, target_dir, verbose)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/ffi/init.py", line 100, in _build_extension
ffi.compile(tmpdir=tmpdir, verbose=verbose, target=libname)
File "/usr/local/lib/python2.7/dist-packages/cffi/api.py", line 690, in compile
compiler_verbose=verbose, debug=debug, **kwds)
File "/usr/local/lib/python2.7/dist-packages/cffi/recompiler.py", line 1513, in recompile
compiler_verbose, debug)
File "/usr/local/lib/python2.7/dist-packages/cffi/ffiplatform.py", line 22, in compile
outputfilename = _build(tmpdir, ext, compiler_verbose, debug)
File "/usr/local/lib/python2.7/dist-packages/cffi/ffiplatform.py", line 58, in _build
raise VerificationError('%s: %s' % (e.class.name, e))
cffi.error.VerificationError: CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1

KeyError: 'boxes' in training my own model

I followed instruction of Train your own model and got vgg16.pth. When training using ./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16, error occurred:

Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /home/ljy/pytorch-examples-master/pytorch-faster-rcnn-master/data/cache/voc_2007_trainval_gt_roidb.pkl
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 104, in <module>
    imdb, roidb = combined_roidb(args.imdb_name)
  File "./tools/trainval_net.py", line 75, in combined_roidb
    roidbs = [get_roidb(s) for s in imdb_names.split('+')]
  File "./tools/trainval_net.py", line 75, in <listcomp>
    roidbs = [get_roidb(s) for s in imdb_names.split('+')]
  File "./tools/trainval_net.py", line 72, in get_roidb
    roidb = get_training_roidb(imdb)
  File "/home/ljy/pytorch-examples-master/pytorch-faster-rcnn-master/tools/../lib/model/train_val.py", line 302, in get_training_roidb
    imdb.append_flipped_images()
  File "/home/ljy/pytorch-examples-master/pytorch-faster-rcnn-master/tools/../lib/datasets/imdb.py", line 113, in append_flipped_images
    boxes = self.roidb[i]['boxes'].copy()
KeyError: 'boxes'
Command exited with non-zero status 1
7.08user 0.19system 0:07.29elapsed 99%CPU (0avgtext+0avgdata 331752maxresident)k
0inputs+40outputs (0major+99561minor)pagefaults 0swaps

BTW, I am confused about modification on vgg16, why do we need following modification ?

import torch
from torch.utils.model_zoo import load_url
from torchvision import models
sd = load_url("https://s3-us-west-2.amazonaws.com/jcjohns-models/vgg16-00b39a1b.pth")
sd['classifier.0.weight'] = sd['classifier.1.weight']
sd['classifier.0.bias'] = sd['classifier.1.bias']
del sd['classifier.1.weight']
del sd['classifier.1.bias']
sd['classifier.3.weight'] = sd['classifier.4.weight']
sd['classifier.3.bias'] = sd['classifier.4.bias']
del sd['classifier.4.weight']
del sd['classifier.4.bias']
torch.save(sd, "vgg16.pth")

Test Without GPU

Hi, I downloaded and run the code on nVidia Tesla P100. I set the -arch flag to sm_60 and successfully trained and tested the model with GPUS on COCO 2014 dataset. I post a blog explaining the details for the steps. I was wondering if it is possible to test it without GPU.

happened to a error when run build.py

Anyone can help me about the above question when I run the build.py ? thank you so much.

how can I build a fit net architecture for coco_900k-1190k model's parameter?

when I try to load the parameter to the net that was bulit like (net.create_architecture(21,tag='default', anchor_scales=[8, 16, 32]))
I got some errors like (RuntimeError: inconsistent tensor size, expected tensor [18 x 512 x 1 x 1] and src [24 x 512 x 1 x 1] to have the same number of elements, but got 9216 and 12288 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86)
(However when I load other parameters from pascal voc model, there are no errors.)
so I guess maybe I create a wrong architecture, but I don' know how to fix it...
thanks for your reply~

Error occur during training , Command terminated by signal 15

This occur when I run ./experiments/scripts/train_faster_rcnn.sh 1 coco res101
With env TITAN X (Pascal) , Ubuntu 16.04.1