gpleiss / efficient_densenet_pytorch Goto Github PK

View Code? Open in Web Editor NEW

1.5K 44.0 329.0 1.12 MB

A memory-efficient implementation of DenseNets

License: MIT License

Python 100.00%

densenet pytorch deep-learning

efficient_densenet_pytorch's Introduction

efficient_densenet_pytorch

A PyTorch >=1.0 implementation of DenseNets, optimized to save GPU memory.

Recent updates

Now works on PyTorch 1.0! It uses the checkpointing feature, which makes this code WAY more efficient!!!

Motivation

While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

This implementation uses a new strategy to reduce the memory consumption of DenseNets. We use checkpointing to compute the Batch Norm and concatenation feature maps. These intermediate feature maps are discarded during the forward pass and recomputed for the backward pass. This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

This implementation is inspired by this technical report, which outlines a strategy for efficient DenseNets via memory sharing.

Requirements

PyTorch >=1.0.0
CUDA

Usage

In your existing project: There is one file in the models folder.

models/densenet.py is an implementation based off the torchvision and project killer implementations.

If you care about speed, and memory is not an option, pass the efficient=False argument into the DenseNet constructor. Otherwise, pass in efficient=True.

Options:

All options are described in the docstrings of the model files
The depth is controlled by block_config option
efficient=True uses the memory-efficient version
If you want to use the model for ImageNet, set small_inputs=False. For CIFAR or SVHN, set small_inputs=True.

Running the demo:

The only extra package you need to install is python-fire:

pip install fire

Single GPU:

CUDA_VISIBLE_DEVICES=0 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>

Multiple GPU:

CUDA_VISIBLE_DEVICES=0,1,2 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>

Options:

--depth (int) - depth of the network (number of convolution layers) (default 40)
--growth_rate (int) - number of features added per DenseNet layer (default 12)
--n_epochs (int) - number of epochs for training (default 300)
--batch_size (int) - size of minibatch (default 256)
--seed (int) - manually set the random seed (default None)

Performance

A comparison of the two implementations (each is a DenseNet-BC with 100 layers, batch size 64, tested on a NVIDIA Pascal Titan-X):

Implementation	Memory cosumption (GB/GPU)	Speed (sec/mini batch)
Naive	2.863	0.165
Efficient	1.605	0.207
Efficient (multi-GPU)	0.985	-

Other efficient implementations

LuaTorch (by Gao Huang)
Tensorflow (by Joe Yearsley)
Caffe (by Tongcheng Li)

Reference

@article{pleiss2017memory,
  title={Memory-Efficient Implementation of DenseNets},
  author={Pleiss, Geoff and Chen, Danlu and Huang, Gao and Li, Tongcheng and van der Maaten, Laurens and Weinberger, Kilian Q},
  journal={arXiv preprint arXiv:1707.06990},
  year={2017}
}

efficient_densenet_pytorch's People

Contributors

Stargazers

Watchers

Forkers

wanjinchang benjamesbabala cclauss xypan1232 githubfragments hedgefair projectafey oppa3109 tony32769 ml-lab fujun-liu zhengfangwu locosoft1986 liuguoyou robert0812 donovanr yifita lvdmaaten rodneymarumo soledad89 mattdemant soumenms2015 huaijin-chen dyz-zju mixcoder zhengrui yuleichin tgong2002 nutszebra resurgo-genetics acgtyrant camelshang svishwa jawaechan zhiwenshao alexliyang clcarwin victorcampos7 wanglixiagithub skyhowie25 monjovi mutual-ai liu3xing3long wushicanasl pinglmlcv papercoming zhaofenqiang wangpei7001 caomw mvpduncan grseb9s zj040045 newmesc erotemic larenzhang hawklucky tplink32 xiongweiwu shijie2016 yydlmzyz sonyeric tigercouple kevin35day shubhampachori12110095 lxtgh greenteahua presageboat negi111111 angzz xychen9459 zheng222 shunsunsun aust-hansen queenie88 smilewsw eric-zhang1990 kevinid wangchengbi qianfu1997 anrhine crazy121 patrickket ustctf-zz chao1224 b2220333 lmm077 davis980520 hzhang57 locussam eandroidsheep henryslzhao zhulei2016 macos stevenwhu afcarl cxrasdfg zifengtian1990 cwell syxd gitfenging

efficient_densenet_pytorch's Issues

Can't not train when using a 256*256 dataset

I can run the network by using CIFAR-10 as my dataset.
However, when I use my own dataset which has the size of 256256 , it can not work.
I tried to transform my data into 3232, it also works.
So how can I solve the scaling problem?

Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version

Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:

when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the for device_idx in range(torch.cuda.device_count()) part in _SharedAllocation() part requires some modification and optimization.
when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.

I meet this problem when I run the demo.py. How to solve it?

0.1
Training
Traceback (most recent call last):
File "demo.py", line 272, in
fire.Fire(demo)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "demo.py", line 250, in demo
n_epochs=n_epochs, batch_size=batch_size, seed=seed)
File "demo.py", line 166, in train
train=True,
File "demo.py", line 101, in run_epoch
output_var = model(input_var)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 218, in forward
features = self.features(x)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 152, in forward
outputs.append(module.forward(outputs))
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 107, in forward
new_features = super(_DenseLayer, self).forward(prev_features)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 85, in forward
return fn(self.norm_weight, self.norm_bias, self.conv_weight, *inputs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 265, in forward
conv_output = self.efficient_conv.forward(conv_weight, None, relu_output)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 457, in forward
self.groups, cudnn.benchmark
TypeError: _cudnn_convolution_full_forward received an invalid combination of arguments - got (torch.cuda.FloatTensor, torch.cuda.FloatTensor, NoneType, torch.cuda.FloatTensor, tuple, tuple, tuple, int, bool), but expected (torch.cuda.RealTensor input, torch.cuda.RealTensor weight, torch.cuda.RealTensor bias, torch.cuda.RealTensor output, std::vector pad, std::vector stride, std::vector dilation, int groups, bool benchmark, bool deterministic)

Cannot reproduce the cifar100 results using models/densenet.py (not efficient)

About mean and stdv

Hi,
In demo.py#L212, the mean value and stdv value are given directly:

mean = [0.5071, 0.4867, 0.4408] 
stdv = [0.2675, 0.2565, 0.2761]

but when I use the compute-cifar10-mean.py to calculate them, I get the result as follows:

means: [0.53129727, 0.52593911, 0.52069134]
stdevs: [0.28938246, 0.28505746, 0.27971658]

these two results are different obviously, can you tell me how to calculate the mean and stdv in original demo?

Thanks!

TabError: inconsistent use of tabs and spaces in indentation

flake8 testing of https://github.com/gpleiss/efficient_densenet_pytorch on Python 3.6.2

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./models/densenet.py:15:66: E999 TabError: inconsistent use of tabs and spaces in indentation
	self.add_module('conv.1', nn.Conv2d(num_input_features, bn_size *
                                                                 ^

./models/densenet_efficient.py:404:15: E999 TabError: inconsistent use of tabs and spaces in indentation
	output = input
              ^

The final test accuracy

Hi, Thanks for this implementation ! I'm wondering how to obtain the quite strong test set result on CIFAR-10, as reported in the original densenet paper (e.g., error rate <=3.5 on C-10+, with depth =190, growth_rate = 40). When I run the script as:

CUDA_VISIBLE_DEVICES=0,1,2,3 python demo.py --depth 190 --efficient False --data ./data --save ./ckpts

The final test error is reported as 0.0535. I'm wondering whether the high error is due to no data augmentation is conducted in the default setting. May I know whether it is C10+ dataset or C10?

Best

test failed on v0.2

the efficient_densenet_bottleneck_test.py failed in test_backward_computes_backward_pass

>       assert(almost_equal(layer.conv.weight.grad.data, layer_efficient.conv_weight.grad.data))
E       assert False
E        +  where False = almost_equal(\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n    5.2501\n\n(0 ,4 ,.,...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n, \n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n  2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n)
E        +    where \n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n    5.2501\n\n(0 ,4 ,.,...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
E        +      where Variable containing:\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
E        +        where Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False).weight
E        +          where Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False) = Sequential (\n  (norm): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True)\n  (relu): ReLU (inplace)\n  (conv): Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)\n).conv
E        +    and   \n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n  2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
E        +      where Variable containing:\n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
E        +        where Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = _EfficientDensenetBottleneck (\n).conv_weight

I uncommented the code in densenet_efficient.py

self.efficient_batch_norm.training = False,

but the issue persists.

not worked in python3 environment

hi , it worked when in python2 environment, but failed in python3.

*** Error in `/home/tengbq/.virtualenvs/py3/bin/python': free(): invalid next size (fast): 0x00007f5f8d34c3b0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f640b7e57e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f640b7ede0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f640b7f198c]
/usr/local/cuda-8.0/lib64/libcudnn.so.6(cudnnDestroyConvolutionDescriptor+0x9)[0x7f63772f4c69]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0x2dfe17)[0x7f6364044e17]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f6364f16834]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch8autograd11ConvForward5applyERKSt6vectorINS0_8VariableESaIS3_EE+0x1192)[0x7f6364287712]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0x410c7e)[0x7f6364175c7e]
/home/tengbq/.virtualenvs/py3/bin/python(_PyObject_FastCallDict+0x8b)[0x55a279d841bb]
/home/tengbq/.virtualenvs/py3/bin/python(+0x19cd3e)[0x55a279e11d3e]
/home/tengbq/.virtualenvs/py3/bin/python(_PyEval_EvalFrameDefault+0x30a)[0x55a279e3619a]
/home/tengbq/.virtualenvs/py3/bin/python(+0x1959a6)[0x55a279e0a9a6]
/home/tengbq/.virtualenvs/py3/bin/python(+0x196a11)[0x55a279e0ba11]

how implement memory efficient DenseNet using Tensorflow?

is there a master can convet pytorch implementation to tf？

Can we test using the trained model.

Is there any option to run ImageNet in this demo?

I checked the small_inputs parameter in model however couldn't find the loading options in demo code.

How about the version of the torchvision, project killer and pyhon-fire?

I hope you tell me the version. If so, thanks.

questions about code in file "densenet_efficient.py"

Hi,

Thanks for sharing the code. I read your code and have two small questions.

Function "recompute_forward" in class "_EfficientBatchNorm" seems to be unused. Is it redundant?
"training = self.efficient_batch_norm.training" in line 266 and "self.efficient_batch_norm.training = training" in line 295 seems to be unnecessary. I think self.efficient_batch_norm.training should be true in both the first forward and the second forward when computing the gradients.

EfficientReLU

maybe it's not the best place to ask this, but I thought I would be able to get some insight from the author directly :)
You used _EfficientReLU where you called the backend operations, is this necessary? What is the gain here, could I simply substitute it with nn.ReLU?
My intention is to change the type of activation here, wonder whether i can do it in a simpler way.

storage resize_ function

I try to use torch.storage in my network. I use Pytorch3.

if self.storage.size() < size:
            is_cuda = self.storage.is_cuda
            if is_cuda:
                gpu_ID = self.storage.get_device()
                print('gpu_ID1:',gpu_ID)
            self.storage.resize_(size)
            
            gpu_ID= self.storage.get_device()
            print('gpu_ID2:',gpu_ID)
            
            if is_cuda:
                self.storage = self.storage.cuda(gpu_ID)
            gpu_ID= self.storage.get_device()
            print('gpu_ID3:',gpu_ID)

The output is

gpu_ID1: 1
gpu_ID2: 0
gpu_ID3: 0

The self.storage comes from self.storage= torch.Storage(1024).
It seems the resize_ function will change the gpu where the storage will be saved.
I wish the storage is saved in GPU 1 rather than 0.
How can i do that?

Broken link - MxNet implementation

Hi. The link for the MxNet implementation provided in README is broken.

multi-GPU version slower with more GPUs

Hi, thanks for the implementation.

I find one weird thing using the multi-GPU option. I am trying to replicate the DenseNet-190-40 model which in the DenseNet paper produced the best result on CIFAR dataset. Using a minibatch size of 256, I can train with >= 4 GPUs but not with 2 GPUs, which indicates the multi-GPU version is working properly. However, I also find that per minibatch training time is about 30% slower with 8 GPUs vs. 4 GPUs. Is this expected and do you know what is slowing down the training?

Thanks!

Test failed on PyTorch 0.3.1 with CUDA 9.0

AssertionError in test_forward_training_true_computes_forward_pass:
assert almost_equal(layer.norm.running_mean, layer_efficient.norm_running_mean)
assert almost_equal(layer.norm.running_var, layer_efficient.norm_running_var)

layer.norm.running_mean =
0.2516
0.0036
-0.6237
0.2686
-1.1193
1.2112
-0.0139
0.0237
[torch.FloatTensor of size 8]
layer_efficient.norm_running_mean=
0.2840
-0.0010
-0.7056
0.3032
-1.2588
1.3351
-0.0184
0.0538
[torch.FloatTensor of size 8]

layer.norm.running_var =
0.7604
0.6536
1.3444
0.1388
1.1254
0.1573
1.3377
0.9247
[torch.FloatTensor of size 8]

layer_efficient.norm_running_var =
0.7321
0.6162
1.3844
0.0518
1.1621
0.0664
1.3355
0.9229
[torch.FloatTensor of size 8]

Softmax layer is missing in the code.

Hi,
I am currently trying to understand ur densenet code. As given in the paper DENSENET which is given as follows:
At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached.
But I am unable to find softmax layer in your optimised code. Please, could you kindly provide me how this has been implemented in Densenet.py file.
Thanks

Any Pre-trained models on ImageNet

Is there pre-trained models ready to use?

Number of parameters doesn't match with naïve implementation

Hi @gpleiss,

I was trying to train an ensemble of DenseNets_BC_100_12 in 2 GPU NVIDIA k80 when I encountered the memory efficient problem. However, I my research is sensible in terms of the number of parameters, and when I moved to this implementation they do not match any more.

In this implementation file you can see how the number of parameters exactly matches the ones reported:

    +-------------+-------------+-------+--------------+
    |    Model    | Growth Rate | Depth | M. of Params |
    +-------------+-------------+-------+--------------+
    |  DenseNet   |     12      |  40   |     1.02     |
    +-------------+-------------+-------+--------------+
    |  DenseNet   |     12      |  100  |     6.98     |
    +-------------+-------------+-------+--------------+
    |  DenseNet   |     24      |  100  |    27.249    |
    +-------------+-------------+-------+--------------+
    | DenseNet-BC |     12      |  100  |    0.769     |
    +-------------+-------------+-------+--------------+
    | DenseNet-BC |     24      |  250  |    15.324    |
    +-------------+-------------+-------+--------------+
    | DenseNet-BC |     40      |  190  |    25.624    |
    +-------------+-------------+-------+--------------+

However, in this other implementation following yours indications

    +-------------+-------------+-------+--------------+
    |    Model    | Growth Rate | Depth | M. of Params |
    +-------------+-------------+-------+--------------+
    | DenseNet-BC |     12      |  100  |    1.108     |
    +-------------+-------------+-------+--------------+
    | DenseNet-BC |     24      |  250  |    4.275     |
    +-------------+-------------+-------+--------------+
    | DenseNet-BC |     40      |  190  |     11.7     |
    +-------------+-------------+-------+--------------+

Is there something else that need to be taken care and I am not seeing?

Thanks a lot in advance,
Pablo

Is there any memory-efficient tensorflow Implementation？

Hi, I have found the same problem about the GPU memory，and is there any memory-efficient tensorflow Implementation？
Thanks very much!

error rate compute in demo.py

Hi, thanks for this efficient densenet code.

But I found a probable mistake in demo.py at line 112:
error = 1 - torch.eq(predictions_var, target_var).float().mean()
it might have to be corrected to:
error = 1 - torch.eq(predictions_var.view(-1), target_var).float().mean()

Because the size of predictions_var and target_var are (train_size, 1) and (train_size, ), torch.eq(...) will return a train_size * train_size matrix, and its entries are almost 0 (only 1 at diagonal). Then the error rate will not able to decrease.

New adaptive pooling layer.

It appears that models in the torchvision are now using adaptive poolings: adaptive_avg_pool2d
to break the tie to the input size: vision/densenet.py

Perhaps that simplify the constructor a little bit and generalize the usage even more?

torch.utils.checkpoint cost too much memory than previous 0.3 version

Hi,
I test the 0.4 version. I find out that the torch.utils.checkpoint cost more memory than your implementation on pytorch0.3 by using torch._C._cudnn_batch_norm_forward.
But i can not find the similar functions like torch._C._cudnn_batch_norm_forward in 0.4.
Do you want to implement it for pytorch0.4?

Compatibility with PyTorch 0.4

Thanks very much for sharing this implementation. I forked the code. It works great on PyTorch 0.3.1. But when I ran it with 0.4.0 (master version), I got following error (I made some minor change so the line number wouldn't match):
File "../networks/densenet_efficient.py", line 330, in forward

bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)

TypeError: Variable data has to be a tensor, but got torch.cuda.FloatStorage

It turned out that for this line:
bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)

The inputs in version 0.3.1 is FloatTensor but in 0.4.0 it's Variable.

I am wondering what's the best way to update the code for 0.4.0?

Many thanks!

Error in trying to use for the first time

Hello,

I am a beginner in python and pyTorch and am trying to use your densenet efficient implementation on a different dataset than CIFAR (images are 80 pixels wide, instead of 32). I use a windows 10 laptop with the experimental pyTorch port on Windows by peterjc123 (see pytorch/pytorch#494).

I have incorporated your DenseNetEfficient model in a training script adapted from andreasveit's densenet implementation for pyTorch and replaced the CIFAR datasets loaders with datasets ImageFolder as follows:
train_loader = torch.utils.data.DataLoader(

datasets.CIFAR10('../data', train=True, download=True,

transform=transform_train),

    datasets.ImageFolder(root=args.dataroot + '/train', transform=transform_train), batch_size=args.batch_size, shuffle=True, **kwargs)

When launching the training script; I get a cryptic error (for me):

Traceback (most recent call last):
File "train.py", line 312, in
main()
File "train.py", line 153, in main
train(train_loader, model, criterion, optimizer, epoch)
File "train.py", line 185, in train
output = model(input_var)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\densenet\densenetEfficient.py", line 213, in forward
out = self.classifier(out)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\linear.py", line 54, in forward
return self.backend.Linear.apply(input, self.weight, self.bias)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn_functions\linear.py", line 12, in forward
output.addmm(0, 1, input, weight.t())
RuntimeError: size mismatch at d:\downloads\pytorch-master-1\torch\lib\thc\generic/THCTensorMathBlas.cu:243

I am surely doing something wrong but searched a lot and did not find anything,

Any recommendation would be welcome,

Thanks a lot,

Christophe

Why the Error is very high?

I run the demo on cifar 10 and have got the result show below. I found the error is around 0.9 in all the epoches. And the final error is very high(0.897). Is it right?
`
Eval: (Epoch 300 of 300) [0016/0020] Time: 0.06535 (1.054) Loss: 0.47998 (0.388) Error: 0.89749 (0.897)

Eval: (Epoch 300 of 300) [0017/0020] Time: 0.06668 (1.120) Loss: 0.34982 (0.385) Error: 0.89259 (0.897)

Eval: (Epoch 300 of 300) [0018/0020] Time: 0.06468 (1.185) Loss: 0.29784 (0.380) Error: 0.89301 (0.897)

Eval: (Epoch 300 of 300) [0019/0020] Time: 0.06477 (1.250) Loss: 0.52827 (0.388) Error: 0.89644 (0.897)

Eval: (Epoch 300 of 300) [0020/0020] Time: 0.03533 (1.285) Loss: 0.39355 (0.389) Error: 0.89657 (0.897)
`

Inference time issue

Hi,

The table shows comparison of speed(sec/mini bach).

Is the speed a training time?

I wonder whether the inference speed of efficient one is also slower than the naive implementation.

Did you compare the inference time, too?

Efficient Conv3d Class

Hi @gpleiss , thanks for this efficient densenet code.
Would you please kindly, implement an 'Efficient Conv3d Class'?
I will really appreciate if you can provide some guidance to me for implementing this class based on your 'EfficientConv2d' class.

Could you add a License?

Hi, I'm using this repo in my project which I plan to release soon, but it seems illegal to use your code without your permission. Could you add an open source license to this repo so that I can include your copyright in my project? It should take less than one minute to do so. Thanks.

Should there be a global average pooling layer before the classifier?

Hi,thank you for your great work! I only find a BatchNorm layer before the final classifier, should there be a global average pooling layer before the classifier?Sorry this is a simple question,looking forward to your reply.

Pre-trained weight

Thank you for your effort. Do you know where to find the pre-trained weight implemented by pytorch?

Validation dataset is being augmented as well

Thanks for the great repo! Just found out that transformation was applied prior to train val split, effectively augmenting validation set. I don't have a neat solution for this but maybe this gist https://gist.github.com/kevinzakka/d33bf8d6c7f06a9d8c76d97a7879f5cb could be a way out, although it has to load twice the same data.

FP become slower after upgrade to 0.4

Hi,
Thanks for your works!
Recently I upgrade my network to 0.4 with your implementation of DenseNet. And I found that the new version is slower than before. I thought that the shared memory could speed up the forward pass obviously.In my application, predicting one subject on the 0.3.x version cost 9s but now it need 11s.

The dice metric also get worth than before. I found that in the new code you use the Kaiming normal initialization but before default initialization (uniform?). I have try to make all parameters as before but it has not effect. Have you some advice for me?

Thanks.

What is the minimum GPU memory required? Still breaks for me in a single GPU

Amazon p3.2xlarge: 1 GPUs - Tesla V100 -- GPU Memory: 16GB -- Batch Size = 64
If efficient = False:
Error: RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 15.75 GiB total capacity; 14.71 GiB already allocated; 4.88 MiB free; 4.02 MiB cached)

If efficient = True:
Error: RuntimeError: CUDA out of memory. Tried to allocate 61.25 MiB (GPU 0; 15.75 GiB total capacity; 14.65 GiB already allocated; 50.88 MiB free; 5.33 MiB cached)

Amazon g3.4xlarge: 1 GPUs - Tesla M60 -- GPU Memory: 8GB -- Batch Size = 64

If efficient = False:
RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)

If efficient = True:
RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)

`DenseNetEfficientMult` not giving same forwarding result as `DenseNetEfficient`

I tried to train on multi gpus, and after a lot of tries I found DenseNetEfficientMulti is not giving same output as DenseNetEfficient, actually DenseNetEfficientMulti's output depends on how many device_ids specified in DataParallel. However, when only specify one gpu, it indeed gives same result as DenseNetEfficient. And when selected gpus are fixed, DenseNetEfficientMulti's outputs are also fixed.

If i just use DenseNetEfficient for multi gpu case, it will say Tensors are on different gpus. I guess there are some buffer which caused this issue.

@taineleau @gpleiss would you either make DenseNetEfficient being able to support multi gpu case, or fix the bug of DenseNetEfficientMulti, I think DenseNetEfficientMulti might has some bugs as its output depends on the number of gpus used. Any insights to solve this will be helpful, thanks.

Pretrained models

Any plans to make pretrained models available?
Or is it possible to use the pretrained models from https://github.com/liuzhuang13/DenseNet#results-on-imagenet-and-pretrained-models?

Thanks in advance,

Could it be MORE memory efficient?

Hi,

Thanks for your code. I read both the single-gpu and multi-gpus codes. For the single-gpu version, you create the shared memory inside each dense block. Could all the dense blocks share the same memory and you only allocate one block of space? I think it should further reduce the space usage.

For the multi-gpus version, you create the shared memory in the initialization method of the whole network, i.e. one level upper the dense block initialization. However, you register a buffer inside each dense block for the shared memory, which is done by
self.register_buffer('CatBN_output_buffer', self.storage)
Does this mean each dense block has independent shared memory? If so, why don't you let them share the same area?

Thanks

Input data size

Hi, thanks for this efficient densenet code.
I have some problems.
I try to use different input data size to train the model,
If the size is smaller than 64 * 64 that will be no problem
but if the size is bigger than 64 * 64 the errors will appear.
RuntimeError: size mismatch at /pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:247
Can you point me how to fix it?
Thank you so much.

does this code support pytorch1.0 and the jit feature for c++ online deployment?

Segmentation fault (core dumped) error for multiple GPUs

Environment:

Python: 3.6
PyTorch: 0.4.0
OS: Ubuntu 18.04.1 LTS
CUDA: V9.1.85
GPU: Tesla K80
Problem:
I was running a model that does not need BatchNorm, so I changed the original DesneNet a little bit.
Here is the code snippet:

def _cat_function_factory(conv, relu):
    def cat_function(*inputs):
        concated_features = torch.cat(inputs, 1)
        bottleneck_output = relu(conv(concated_features))
        return bottleneck_output
    return cat_function


class _DenseLayer(nn.Module):
    def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
        super(_DenseLayer, self).__init__()
        self.add_module('conv1', nn.Conv2d(num_input_features, bn_size * growth_rate, 1))
        self.add_module('relu1', nn.ReLU(inplace=True))
        self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate, 3, padding=1))
        self.add_module('relu2', nn.ReLU(inplace=True))
        self.drop_rate = drop_rate

    def forward(self, *inputs):
        cat_function = _cat_function_factory(self.conv1, self.relu1)
        if any(feature.requires_grad for feature in inputs):
            output = cp.checkpoint(cat_function, *inputs)
        else:
            output = cat_function(*inputs)
        new_features = self.relu2(self.conv2(output))
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
        return new_features


class _DenseBlock(nn.Module):
    def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            layer = _DenseLayer(num_input_features + i * growth_rate,
                                growth_rate, bn_size, drop_rate)
            self.add_module(f'denselayer{i + 1}', layer)

    def forward(self, init_features):
        features = [init_features]
        for name, layer in self.named_children():
            new_features = layer(*features)
            features.append(new_features)
        return torch.cat(features, 1)

It can run on single GPU, but it throws a Segmentation fault (core dumped) error when running on multiple GPUS. What can be caused this issues?

use example

Hi,

Would it be possible to also have a test example in the demo? (not only train)

When I try to use the network I trained, the following command:
outputs = model(torch.autograd.Variable(images))

Throws this kind of error message:

File "evaluate.py", line 65, in main
outputs = model(torch.autograd.Variable(images))
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\densenet\densenetEfficient.py", line 205, in forward
features = self.features(x)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\container.py", line 64, in forward
input = module(input)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\conv.py", line 237, in forward
self.padding, self.dilation, self.groups)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\functional.py", line 43, in conv2d
return f(input, weight, bias)
RuntimeError: expected CPU tensor (got CUDA tensor)

Thanks in advance,

Christophe

pretrained densenet169 weights

Hi, thanks for your great work!
I'm working on densenet169 these days, do you know where I can find the ImageNet pretrained weights for this efficient implementation? Or do you have any example code to show how to convert the other implementation's pretrained model to this one?
I do have noticed this #13 , but it seems @ZhengRui didn't provide any example code, and I don't know where to start..

The efficient implementation do not compatible with parallel computing

Hi, I run the demo in this repo without changing anything except the cifar data directory. However, it raised a Runtime: RuntimeError: tensors are on different GPUs. Then I tried the naive implementation with command line flag --efficient=False, it actually worked fine. By the way, the efficient implementation can work well after I modify the code to not use torch.nn.DataParallel. Do you actually test your implementation in multiple GPU? I guess since you manually change the behaviour of gradient flow, something wrong...

Eliminate reduce() ?

The reduce() function was dropped in Python 3. https://docs.python.org/3.0/whatsnew/3.0.html#builtins There is still an implementation in functools but the advise is to use a loop instead. How should https://github.com/gpleiss/efficient_densenet_pytorch/blob/master/models/densenet_efficient.py#L146 be modified to work in both Python 2 and Python 3?

Unable to run demo.

I tried to run the demo.py file, and got the error :
('The function received no value for the required argument:', 'data')

MultiGPU efficient densenets are slow

I just want to benchmark the new implementation of efficient densenet with the code here. However, it seems that the used checkpointed modules are not broadcast to multiple GPUs as I got the following errors:

  File "/home/changmao/efficient_densenet_pytorch/models/densenet.py", line 16, in bn_function
    bottleneck_output = conv(relu(norm(concated_features)))
  File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 49, in forward
    self.training or not self.track_running_stats, self.momentum, self.eps)
  File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1194, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

I think that the checkpoint feature provides weak support for nn.DataParallel.

efficient seams not so efficient.

hi, @gpleiss ,
i am so appreciate with your great job. but when i try your code on 448*448 images, depth of 40 net can
not work with batch size of 10. so it will not be trained on depth of 101. my gpu is 11GB.
can you help me?