tum-vision / fusenet Goto Github PK

This repository is the official release of the code for the following paper "FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture" which is published at the 13th Asian Conference on Computer Vision (ACCV 2016).

License: Other

CMake 2.78% Makefile 0.70% C++ 80.65% Cuda 5.71% MATLAB 0.90% Python 8.80% Shell 0.39% Dockerfile 0.07%

caffemodel cnn-architecture

fusenet's Introduction

FuseNet

[PyTorch]

Please refer to PyTorch implementation for an up-to-date implementation.

FuseNet is developed as a general architecture for deep convolutional neural network (CNN) to train dataset with RGB-D images. It can be used for semantic segmentation, scene classification and other applications. This repository is an official release of this paper, and it is implemented based on the BVLC/caffe framework.

Using FuseNet
Released Caffemodel
Publication
License and Contact

Usage

Installation

The code is compatible with an early Caffe version of June 2016. It is developed under Ubuntu 16.04 with CUDA 7.5 and cuDNN v5.0. If you use the program under other Ubuntu distributions, you may need to comment out line 72--73 in the root CMakeLists.txt file. If you compile under other OS, please use Google as your friend. We mostly test the program with Nvidia Titan X GPU. Please note multi-GPU training is supported.

git clone https://github.com/tum-vision/fusenet.git
mkdir build && cd build
cmake ..
make -j10
make runtest -j10

Training and Testing

We provide all needed python scripts and prototxt files to reproduce our published results under ./fusenet/. A short guideline is given below. For further detailed instructions, check here.

Initiazation

Our network architecture is based on VGGNet-16layer. However, since we have extra input channel for depth, we provide the compute the

Data preparation

To store dataset, we save paired RGB-D images into LMDB. We also scale the original depth image to the the range of [0, 255]. It is optional further cast the scaled depth value into unsigned char (grayscale), so as to save memory. If you do not want to lose precision, store the scaled depth as float. To prepare LMDB, we provide the following python scripts for your reference. However, you can also write your own image input layer to grab paired RGB-D images.

demo   ./fusenet/scripts/save_lmdb.py

LMDB shuffling

We support LMDB shuffling, and recommend to do shuffling after each epoch during training. To enable this option, flag shuffle to be true for the DataLayer in the prototxt. Note that we do not support shuffling with LevelDB.

demo   ./fusenet/segmentation/nyuv2_sf1/train.prototxt

Weighted cross-entropy loss

On common technique to handle class imbalance is to give the loss of each class a different weight, which typically has a higher value for less frequent classes and a lower value for more frequent class. For semantic segmentation, we support this loss weighting with the SoftmaxWithLossLayer by allowing user to specify a weight for each label. One way to set the weights is accordingly to the inverse class frequency (see our paper for detail). We provide the weights used in our paper in ./fusenet/data/.

Batch normalization

We use batch normalization after each convolution. This is supported by the Caffe BatchNormLayer. Notice that we add ScaleLayer after each BatchNormLayer.

Testing

To test the semantic segmentation performance, we provide the python scripts to calculate the global accuracy, average class accuracy and average intersection-over-union score. The implementation is based on confusion matrix.

demo   ./fusenet/scripts/test_segmentation.py

Released Caffemodel

Semantic Image Segmentation

The items marked with ticks are already available for downloading, otherwise they will be released soon. Unless otherwise stated, all models are finetuned from pretrained VGGNet-16Layer model. Stay tuned 🔥

NYUv2 40-class semantic segmentation

More information about the dataset, check here.

FuseNet-SF1:

This model is trained with the FuseNet Sparse-Fusion1 architecture on 320x240 resolution. To obtain 640x480 full resolution, you can use bilinear upsample the segmenation or better with CRF refinement.

deploy.prototxt
caffemodel

FuseNet-SF5:

This model is trained the FuseNet SparseFusion5 architecture on 320x240 resolution. It gives 66.0% global pixelwise accuracy, 43.4% average classwise accuracy and 32.7% average classwise IoU.

deploy.prototxt
caffemodel

SUN-RGBD 37-class semantic segmentation

More information about the dataset, check here.

FuseNet-SF5:

This model is trained with 224x224 resolution. It gives 76.3% global pixelwise accuracy, 48.30% average classwise accuracy and 37.3% average classwise IoU,

deploy.prototxt
caffemodel

Publication

If you use this code or our trainined model in your work, please consider cite the following paper.

Caner Hazirbas, Lingni Ma, Csaba Domokos and Daniel Cremers, "FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture", in proceedings of the 13th Asian Conference on Computer Vision, 2016. (pdf)

@inproceedings{fusenet2016accv,
 author    = "C. Hazirbas and L. Ma and C. Domokos and D. Cremers",
 title     = "FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture",
 booktitle = "Asian Conference on Computer Vision",
 year      = "2016",
 month     = "November",
}

License and Contact

BVLC/caffe is released under the BSD 2-Clause license. The modification to the original code is released under GNU General Public License Version 3 (GPLv3).

Contact Lingni Ma ✉️ for questions, comments and reporting bugs.

fusenet's People

Contributors

Stargazers

Watchers

Forkers

mayanxin89 caomw deepalcoholic kevin0932 christophmluscher westamine arasharchor archenroot sophianwpu dreamy2015 xxsong5 melights mrerabek lukeqsun zgsxwsdxg ajaycharan c8pan allan9977 mornyy scitao ziqiwangsilvia pandinosaurus kamiyuanyang liketheflower ai3dvision rgbd-cnn hyuantan eong2012 fdp0525 donrv xiaohedu doctorwk007 bobdeng1974 wpfhtl xieshuaicn zhy07013216 le-walter underwater-relate roar-robotics oscarn2

fusenet's Issues

where is the code in script?

@SummerIcequeen @hazirbas
I want to have a try with your codes, but I can't find the code you referred in the README. The /fusenet/scripts/save_lmdb.py and ./fusenet/scripts/test_segmentation.py, will you release the code, please?

How to process depth information ?

hello
I want to know how the depth channel pre-training model is generated, RGB channel and depth channel is called different pre-training model? Thanks.

can't find ./fusenet/scripts/test_segmentation.py

get the lower performance

Can you release your evaluation code?
I have tried to train the network. But I got the lower iou and acc.
Thanks

calculate TP,FP and FN

Hello,
I faced up with a problem that the size of labeled Image in mat is 640480 while the size of predicted label Image is 320240. So how to calculate the TP， FP and FN using two images with different size .

Thanks

can't seem to find test_segmentation.py script

@hazirbas i would appreciate a lot if you could provide me with test_segmentation.py script,it seems its not present in the scripts folder. Thank you

about caffemodel

Hi,
I have found the download_model_from_gist.sh in the "script" folder ,however I could not find the gist_id.
So I wonder if the caffemodels have been shared ? Or, where could I find the prototxt that discribes the hyper-parameter ?
Thanks.

Update Caffe version

Hello, could you merge fusenet with the latest version of Caffe?

Problem about dims in training

When I train the fusenet, I got the following error:

F0302 14:54:53.849881  9560 blob.cpp:32] Check failed: shape[i] >= 0 (-1 vs. 0) 
*** Check failure stack trace: ***
    @     0x7f208aedae3d  google::LogMessage::Fail()
    @     0x7f208aedcbc0  google::LogMessage::SendToLog()
    @     0x7f208aedaa23  google::LogMessage::Flush()
    @     0x7f208aedd58e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f208b4b662b  caffe::Blob<>::Reshape()
    @     0x7f208b578673  caffe::SliceLayer<>::Reshape()
    @     0x7f208b5e8d96  caffe::Net<>::Init()
    @     0x7f208b5ea5e1  caffe::Net<>::Net()
    @     0x7f208b432c0a  caffe::Solver<>::InitTrainNet()
I0302 14:54:53.852614  9567 db_lmdb.hpp:53] shuffle data, start epoch 3
    @     0x7f208b434017  caffe::Solver<>::Init()
    @     0x7f208b4343ba  caffe::Solver<>::Solver()
    @     0x7f208b4a1e83  caffe::Creator_SGDSolver<>()
    @           0x40b978  train()
    @           0x408618  main
    @     0x7f208992d830  (unknown)
    @           0x408d89  _start
Aborted (core dumped)

I know the question may be the dimension of net, so I print the size of blob:

top size : 3
/home/jason/Jason/Code/Semantic_Segmentation/fusenet/src/caffe/layers/slice_layer.cpp, 56
2, 3, 480, 640
/home/jason/Jason/Code/Semantic_Segmentation/fusenet/src/caffe/layers/slice_layer.cpp, 56
2, 1, 480, 640
/home/jason/Jason/Code/Semantic_Segmentation/fusenet/src/caffe/layers/slice_layer.cpp, 56
2, -1, 480, 640

So I think maybe I organize the data in a wrong way. I use the following list file to make lmdb dataset by order color images, depth images, label images, use the default convert_imagedataset offered by caffe:

00002_colors.png
00003_colors.png
...
00002_depth.png
00003_depth.png
...
00002_label.png
00003_label.png

And I wonder both label image and depth are one channel, how can the net classify them?

At the mean time , I am not clearly know how to use weighted cross-entropy, can you help me?

Thanks all the same.

Predict using C++

Hello,
When I use trained caffe model to predict with the c++ code changed according to the "cpp_classification" example in Caffe project, the result of this program is "blob",and the values are very small, kline 0.0009; Are there some lines need to be corrected?
Thanks.

`#include <caffe/caffe.hpp>
#ifdef USE_OPENCV
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#endif // USE_OPENCV
#include
#include
#include
#include
#include
#include

#ifdef USE_OPENCV
using namespace caffe; // NOLINT(build/namespaces)
using std::string;

class MyClassifier {
public:
MyClassifier(const string& model_file,
const string& trained_file,
const string& mean_file);
cv::Mat Predict(const cv::Mat& rgbImg, const cv::Mat& depthImg);
private:
void SetMean(const string& mean_file);
void WrapInputLayer(std::vectorcv::Mat* input_channels);
void Preprocess(const cv::Mat& rgbImg, const cv::Mat& depthImg,std::vectorcv::Mat* input_channels);

private:
shared_ptr<Net > net_;
cv::Size input_geometry_rgb;
cv::Size input_geometry_depth;
int num_channels_rgb;
int num_channels_depth;
cv::Mat mean_rgb;
cv::Mat mean_depth;
};

MyClassifier::MyClassifier(const string& model_file,
const string& trained_file,
const string& mean_file) {
#ifdef CPU_ONLY
Caffe::set_mode(Caffe::CPU);
#else
Caffe::set_mode(Caffe::GPU);
#endif

/* Load the network. */
net_.reset(new Net<float>(model_file, TEST));
net_->CopyTrainedLayersFrom(trained_file);

CHECK_EQ(net_->num_inputs(), 2) << "Network should have exactly two input.";
CHECK_EQ(net_->num_outputs(), 1) << "Network should have exactly one output.";
Blob<float>* input_layer_rgb = net_->input_blobs()[0];
Blob<float>* input_layer_depth = net_->input_blobs()[1];
num_channels_rgb = input_layer_rgb->channels();
num_channels_depth = input_layer_depth->channels();
CHECK(num_channels_rgb == 3)
    << "RGB Input layer should have 3 channels.";
CHECK(num_channels_depth == 1)
<< "Depth Input layer should have 1 channels.";
input_geometry_rgb = cv::Size(input_layer_rgb->width(), input_layer_rgb->height());
input_geometry_depth = cv::Size(input_layer_depth->width(), input_layer_depth->height());

/* Load the binaryproto mean file. */
SetMean(mean_file);
}

/* Load the mean file in binaryproto format. */
void MyClassifier::SetMean(const string& mean_file) {
BlobProto blob_proto;
ReadProtoFromBinaryFileOrDie(mean_file.c_str(), &blob_proto);

/* Convert from BlobProto to Blob<float> */
Blob<float> mean_blob;
mean_blob.FromProto(blob_proto);
//mean prototxt include 1label channel: shoudl subtact 1
CHECK_EQ(mean_blob.channels()-1, (num_channels_rgb+num_channels_depth))
    << "Number of channels of mean file doesn't match input layer.";

/* The format of the mean file is planar cv_8uc4 float BGR + grayscale. */
std::vector<cv::Mat> channels_rgb;
std::vector<cv::Mat> channels_depth;
float* data = mean_blob.mutable_cpu_data();
for (int i = 0; i < (num_channels_rgb+num_channels_depth); ++i) {
    /* Extract an individual channel. */
    cv::Mat channel(mean_blob.height(), mean_blob.width(), CV_8UC1, data);
    if(i<num_channels_rgb) {
        channels_rgb.push_back(channel);
    }else {
        channels_depth.push_back(channel);
    }
    data += mean_blob.height() * mean_blob.width();
}

/* Merge the separate channels into a single image. */
cv::Mat meanRgb,meanDepth;
cv::merge(channels_rgb, meanRgb);
cv::merge(channels_depth, meanDepth);

/* Compute the global mean pixel value and create a mean image
* filled with this value. */
cv::Scalar channel_mean_rgb = cv::mean(meanRgb);
cv::Scalar channel_mean_depth = cv::mean(meanDepth);
mean_rgb = cv::Mat(input_geometry_rgb, meanRgb.type(), channel_mean_rgb);
mean_depth = cv::Mat(input_geometry_depth, meanDepth.type(), channel_mean_depth);

}

cv::Mat MyClassifier::Predict(const cv::Mat& rgbImg, const cv::Mat& depthImg) {
Blob* input_layer_rgb = net_->input_blobs()[0];
Blob* input_layer_depth = net_->input_blobs()[1];

input_layer_rgb->Reshape(1, num_channels_rgb,
                   input_geometry_rgb.height, input_geometry_rgb.width);
input_layer_depth->Reshape(1, num_channels_depth,
                         input_geometry_depth.height, input_geometry_depth.width);
/* Forward dimension change to all layers. */
net_->Reshape();

std::vector<cv::Mat> input_channels;
WrapInputLayer(&input_channels);

Preprocess(rgbImg,depthImg, &input_channels);

net_->Forward();

/* Copy the output layer to a std::vector */
Blob<float>* output_layer = net_->output_blobs()[0];
const float * result=output_layer->cpu_data();
cv::Mat resultMat(output_layer->height(),output_layer->width(), CV_32FC1,(float *)result);
cv::Mat resultMatUC;
return resultMat;

}

void MyClassifier::WrapInputLayer(std::vectorcv::Mat* input_channels) {
Blob* input_layer_rgb = net_->input_blobs()[0];
Blob* input_layer_depth = net_->input_blobs()[1];

int width = input_layer_rgb->width();
int height = input_layer_rgb->height();
float* input_data = input_layer_rgb->mutable_cpu_data();
for (int i = 0; i < input_layer_rgb->channels(); ++i) {
    cv::Mat channel(height, width, CV_8UC3, input_data);
    input_channels->push_back(channel);
    input_data += width * height;
}
width = input_layer_depth->width();
height = input_layer_depth->height();
input_data = input_layer_depth->mutable_cpu_data();
for (int i = 0; i < input_layer_depth->channels(); ++i) {
    cv::Mat channel(height, width, CV_8UC1, input_data);
    input_channels->push_back(channel);
    input_data += width * height;
}

}

void MyClassifier::Preprocess(const cv::Mat& rgbImg, const cv::Mat& depthImg,
std::vectorcv::Mat* input_channels) {
/* Convert the input image to the input image format of the network. */
cv::Mat sample_rgb,sample_depth;
if (rgbImg.channels() == 4)
cv::cvtColor(rgbImg, sample_rgb, cv::COLOR_BGRA2BGR);
else if (rgbImg.channels() == 1 )
cv::cvtColor(rgbImg, sample_rgb, cv::COLOR_GRAY2BGR);
else
sample_rgb = rgbImg;
//depth image
if (depthImg.channels() == 4)
cv::cvtColor(depthImg, sample_depth, cv::COLOR_BGRA2GRAY);
else if (depthImg.channels() == 3 )
cv::cvtColor(depthImg, sample_depth, cv::COLOR_BGR2GRAY);
else if (depthImg.channels() == 1){
if(depthImg.type()!=CV_8UC1){
depthImg.convertTo(sample_depth,CV_8UC1,1.0/255);
}else {
sample_depth = depthImg;
}
}

cv::Mat sample_resized_rgb;
if (sample_rgb.size() != input_geometry_rgb)
    cv::resize(sample_rgb, sample_resized_rgb, input_geometry_rgb);
else
    sample_resized_rgb = sample_rgb;

cv::Mat sample_resized_depth;
if (sample_depth.size() != input_geometry_depth)
    cv::resize(sample_depth, sample_resized_depth, input_geometry_depth);
else
    sample_resized_depth = sample_depth;


cv::Mat sample_normalized_rgb,sample_normalized_depth;
cv::subtract(sample_resized_rgb, mean_rgb, sample_normalized_rgb);
cv::subtract(sample_resized_depth, mean_depth, sample_normalized_depth);

/* This operation will write the separate BGR planes directly to the
* input layer of the network because it is wrapped by the cv::Mat
* objects in input_channels. */
cv::split(sample_normalized_rgb, *input_channels);
//depth image is 1 channel
input_channels->push_back(sample_resized_depth);

}

int main(int argc, char** argv) {
// if (argc != 6) {
// std::cerr << "Usage: " << argv[0]
// << " deploy.prototxt network.caffemodel"
// << " mean.binaryproto img.jpg depth.png" << std::endl;
// return 1;
// }

::google::InitGoogleLogging(argv[0]);

// string model_file = argv[1];//deploy.prototxt
// string trained_file = argv[2];//caffemodel
// string mean_file = argv[3];//mean.prototxt
// string file = argv[4];
// string fileDepth = argv[5];//mean.prototxt

string model_file="segmentation/nyu40-sf1/deploy.prototxt";
string trained_file="caffemodels_iter_75000.caffemodel";
string mean_file="db/mean/mean.prototxt";
string file="NYU0015/fullres/NYU0015.jpg";
string fileDepth="NYU0015/fullres/NYU0015.png";

MyClassifier classifier(model_file, trained_file, mean_file);


std::cout << "---------- Prediction for "
        << file << " ----------" << std::endl;

cv::Mat img = cv::imread(file, -1);
cv::Mat imgDepth = cv::imread(fileDepth, -1);
CHECK(!img.empty()) << "Unable to decode image " << file;
cv::Mat result=classifier.Predict(img,imgDepth);

// std::cout<<result.type() <<std::endl;
// std::cout<<result.cols <<std::endl;
// std::cout<<result.rows <<std::endl;
}
#else
int main(int argc, char** argv) {
LOG(FATAL) << "This example requires OpenCV; compile with USE_OPENCV.";
}
#endif // USE_OPENCV
`

Problem about preparing LMDB

Hi.
Could you provide the document to prepare LMDB(./fusenet/scripts/save_lmdb.py)?Because the default caffe lmdb writer does not accept 4-D input image.
Thanks very much!

why don't you use images 640*480

why you prefer to use 320240 as input data, rather than 640480?

upload save_lmdb.py

Hi, could you please upload the file scripts/save_lmdb.py? Thanks.

Problem about LMDB with C++

Hello,
I am not sure whether this is the right way to create lmdb file to train fusenet:
I created a program to convert images to lmdb using c++ opencv. And I use the NYU dataset in SUNRGBD dataset. Firstly I load a rgb image file to a cv::Mat with 3 channels , load the corresponding depth file to a cv::mat with 1 channel(char) , create a label cv::Mat with 1 channel . By the way , all the cv::Mat has the same size. Then, I use the cv::merge function to merge all the three cv::Mat to a new cv::Mat with 5 channels. At last, I write it into lmdb file by converting the new cv::Mat to datum.
If that is the right way, I am faced up with another problem: how to find right value to fill the label cv:Mat ? The way I use is to analyze the corresponding index.json file of each rgb image and get an array of polygon, then get x and y values of each element of polygon array to create 2D points. And then I get an area for each object using drawContours (opencv function) in a mask Mat which is created for each polygon array element . The value filling in the area of the mask is the index of the corresponding label order in classes.txt( which I use the 40 classes).At last I add all the masks to one mask which is the label cv::Mat for an image. Is that right?
Hope for your response.