Giter Site home page Giter Site logo

cunn's Introduction

# CUDA backend for the Neural Network Package #

This package provides a CUDA implementation for many of the modules in the base nn package: nn

  • Modules: There are also additional GPU-related modules not found in the nn package.

Installing from source

git clone https://github.com/torch/cunn
cd cunn
luarocks make rocks/cunn-scm-1.rockspec

To use

Simply convert your network model to CUDA by calling :cuda():

local model = nn.Sequential()
model:add(nn.Linear(2,2))
model:add(nn.LogSoftMax())

model:cuda()  -- convert model to CUDA

... and similarly for your tensors:

local input = torch.Tensor(32,2):uniform()
input = input:cuda()
local output = model:forward(input)

... or create them directly as CudaTensors:

local input = torch.CudaTensor(32,2):uniform()
local output = model:forward(input)

To run unit-tests

luajit -l cunn -e 'cunn.test()'

GPU Training Concepts

Performance

  • data should be transferred between main memory and gpu in batches, otherwise the transfer time will be dominated by latency associated with speed of light, and execution overheads, rather than by bandwidth
  • therefore, train and predict using mini-batches
  • allocating GPU memory causes a sync-point, which will noticeably affect performance
    • therefore try to allocate any CudaTensors once, at the start of the program, and then simply copy data backwards and forwards between main memory and existing CudaTensors
  • similarly, try to avoid any operations that implicitly allocate new tensors. For example, if you write:
require 'cutorch'

local a = torch.CudaTensor(1000):uniform()
for it=1,1000 do
  local b = torch.add(a, 1)
end

... this will allocate one thousand new CudaTensors, one for each call to torch.add(a, 1).

Use instead this form:

require 'cutorch'

local a = torch.CudaTensor(1000):uniform()
local b = torch.CudaTensor(1000):uniform()
for it=1,1000 do
  b:add(a, 1)
end

In this form, b is allocated only once, before the loop. Then the b:add(a,1) operation will perform the add inside the GPU kernel, and store the result into the original b CudaTensor. This will run noticeably faster, in general. It's also a lot less likely to eat up arbitrary amounts of memory, and less likely to need frequent calls to collectgarbage(); collectgarbage().

Benchmarking

  • GPU operations will typically continue after an instruction has been issued
  • eg, if you do:
require 'cutorch'
local a = torch.CudaTensor(1000,1000):uniform()
a:add(1)

... the GPU kernel to add 1 will only be scheduled for launch by a:add(1). It might not have completed yet, or even have reached the GPU, at the time that the a:add(1) returns

  • therefore for running wall-clock timings, you should call cutorch.synchronize() before each timecheck point:
require 'cutorch'
require 'sys'

local a = torch.CudaTensor(1000,1000):uniform()
cutorch.synchronize()
start = sys.tic()
a:add(1)
cutorch.synchronize()
print(sys.toc())

cunn's People

Contributors

adamlerer avatar ajtulloch avatar andreaskoepf avatar andresy avatar apaszke avatar borisfom avatar btnc avatar clementfarabet avatar colesbury avatar csarofeen avatar dominikgrewe avatar fbesse avatar fmassa avatar gchanan avatar hughperkins avatar huihuifan avatar ipanchenko avatar jhjin avatar jnhwkim avatar jonathantompson avatar kmul00 avatar leonbottou avatar mikepound avatar mys007 avatar nicholas-leonard avatar pavanky avatar samehkhamis avatar soumith avatar szagoruyko avatar wickedfoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cunn's Issues

can't compile

/home/nicholas14/cunn/ClassNLLCriterion.cu(11): error: identifier "assert" is undefined

Inconsistent behaviour of ClassNLLCriterion wrt nn

Following https://groups.google.com/forum/#!topic/torch7/cw2hetc_YjQ , there seems to be an inconsistency between nn and cunn versions of ClassNLLCriterion, when the target has 2 dimensions (or more).

nn simply throws an error, while cunn doesn't. Reading the source code of CUDA version, I have the impression that if the target is 2D, then it requires it to be all zeros, except in the corresponding target position, which should have the index. It's a bit confusing to explain, maybe an example is easier

require 'cunn'

m = nn.ClassNLLCriterion()
a = torch.rand(2,3)
t1 = torch.Tensor({2,1})
t2 = torch.zeros(2,3); t2[1][2] = 2; t2[2][1] = 1;
t3 = torch.zeros(2,3); t3[1][1] = 2; t3[2][1] = 1;
-- t2 and t3  is similar to a one-hot encoding, but each
-- non-zero element should have the proper target index
-- in whatever position it whats
print(m:forward(a,t1)) 
print(m:forward(a,t2)) -- error in nn

-- cuda
m:cuda()
a = a:cuda(); t1 = t1:cuda(); t2= t2:cuda(); t3 = t3:cuda()
print(m:forward(a,t1)) 
print(m:forward(a,t2)) -- works in cunn, exactly as the previous
print(m:forward(a,t3)) -- works in cunn, exactly as the previous

They all produce the same results, whenever it works.

Is there a reason/logic behind this implementation ? Is it to support multi-target multi-class learning ?

Dependency Error on Installation

Fails when installing cunn
On: OSX 10.10.5, cmake 3.3.2, CUDA 7.5.21, fresh build of Torch7

Output:
luarocks install cunn
Installing https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec... switching to 'build' mode

Missing dependencies for cunn:
cutorch >= 1.0

Using https://raw.githubusercontent.com/torch/rocks/master/cutorch-scm-1.rockspec... switching to 'build' mode
Cloning into 'cutorch'...
remote: Counting objects: 82, done.
remote: Compressing objects: 100% (79/79), done.
remote: Total 82 (delta 7), reused 26 (delta 0), pack-reused 0
Receiving objects: 100% (82/82), 126.43 KiB, done.
Resolving deltas: 100% (7/7), done.
cmake -E make_directory build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/Users/jrbaldwin/torch/install/bin/.." -DCMAKE_INSTALL_PREFIX="/Users/jrbaldwin/torch/install/lib/luarocks/rocks/cutorch/scm-1" && make -j$(getconf _NPROCESSORS_ONLN) install

-- The C compiler identification is AppleClang 6.1.0.6020049
-- The CXX compiler identification is AppleClang 6.1.0.6020049
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Torch7 in /Users/jrbaldwin/torch/install
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "7.5", minimum required is "5.5")
-- Compiling for CUDA architecture: 3.0
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/luarocks_cutorch-scm-1-4865/cutorch/build
[ 6%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCReduceApplyUtils.cu.o
[ 6%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorSort.cu.o
[ 6%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCBlas.cu.o
[ 8%] Generating TensorMath.c
[ 10%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCStorageCopy.cu.o
[ 13%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCStorage.cu.o
[ 15%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensor.cu.o
[ 17%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorCopy.cu.o
In file included from :326:
In file included from :13:
In file included from /usr/local/cuda/include/cuda_runtime.h:112:
/usr/local/cuda/include/common_functions.h:65:10: fatal error: 'string.h' file not found
In file included from :326:
In file included from :13:
In file included from /usr/local/cuda/include/cuda_runtime.h:112:
/usr/local/cuda/include/common_functions.h:65:10: fatal error: 'string.h' file not found
In file included from :326:
In file included from :13:
In file included from /usr/local/cuda/include/cuda_runtime.h:112:
/usr/local/cuda/include/common_functions.h:65:10: fatal error: 'string.h' file not found
IIn nf ilfei lien cilnucdleudd efdr ofmr o<mb ui:n3>2:63:2
6I:n
Ifni lfei lien cilnucdleudd efdr ofmr o<mc on:e1>3::1
3I:n
Ifni lfei lien cilnucdleudd efdr ofmr o/mu s/ru/slro/claolc/aclu/dcau/dian/cilnucdleu/dceu/dcau_drau_nrtuinmtei.mhe:.1h1:21:1
2:
/usr/local/cu/duas/ri/nlcolcuadle//ccuodmam/oinn_cfluundcet/icoonmsm.ohn_functions.h:65:10: fata:l6 5e:r1r0o:r :f a'tsatlr ienrgr.ohr': f'islter innogt. hf'o ufnidl
e not found
In file included from :326:
In file included from :13:
In file included from /usr/local/cuda/include/cuda_runtime.h:112:
/usr/local/cuda/include/common_functions.h:65:10: fatal error: 'string.h' file not found
In file included from :326:
In file included from :13:
In file included from /usr/local/cuda/include/cuda_runtime.h:112:
/usr/local/cuda/include/common_functions.h:65:10: fatal error: 'string.h' file not found

iinncclluuddee <<ssttrriinngg..hh>>

              ^^

include <string.h>

     ^

include <string.h>

     ^

include <string.h>

     ^

include <string.h>

     ^

include <string.h>

     ^

Scanning dependencies of target cutorch_static
1 error generated.
1 error generated.
1 error generated.
1 error generated.
CCMMaakkee EErrrroorr aatt TTHHCC__ggeenneerraatteedd__TTHHCCTSetnosroagre.Ccoup.yo..ccum.aok.ec:m2a0kCe7M: a2(k0em7 e Es(rsmraeogsres )aa:gt
e )T :HE
Cr _r goEerrn regoreran teegrdea_nteTirHanCtgiB
nl ga
s/ .t cm/upt./mopl./uclamuraaokrcoekcsk:_s2c_0uc7ut too(rrmccehhs--sssaccgmme--)11:--
44 88 66E55r//rccouuttoorcrchh//bbuuiirll ddg//ellniiebbr//aTTtHiCnH/gC
C/ MC aM/katekmFepiF/lilelusea/srT/oHTcCHk.Csd._idcriu/rt//o./r/.cT/hHT-CHs_Ccg_meg-ne1en-re4ar8ta6et5de/_dcT_uHTtCHoSCrtTcoehrn/asbgoueriC.locdpu/y.l.oic
bu
/.ToH
C
/C
Make
Files/THC.dir//./THC_generated_THCBlas.cu.o

make[2]: *** [lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensor.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [lib/THC/CMakeFiles/THC.dir/THC_generated_THCBlas.cu.o] Error 1
make[2]: *** [lib/THC/CMakeFiles/THC.dir/THC_generated_THCStorageCopy.cu.o] Error 1
CMake Error at THC_generated_THCTensorCopy.cu.o.cmake:207 (message):
Error generating
/tmp/luarocks_cutorch-scm-1-4865/cutorch/build/lib/THC/CMakeFiles/THC.dir//./THC_generated_THCTensorCopy.cu.o

make[2]: *** [lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorCopy.cu.o] Error 1
1 error generated.
CMake Error at THC_generated_THCReduceApplyUtils.cu.o.cmake:207 (message):
Error generating
/tmp/luarocks_cutorch-scm-1-4865/cutorch/build/lib/THC/CMakeFiles/THC.dir//./THC_generated_THCReduceApplyUtils.cu.o

make[2]: *** [lib/THC/CMakeFiles/THC.dir/THC_generated_THCReduceApplyUtils.cu.o] Error 1
[ 19%] Building C object CMakeFiles/cutorch_static.dir/Storage.c.o
[ 21%] Building C object CMakeFiles/cutorch_static.dir/init.c.o
[ 26%] Building C object CMakeFiles/cutorch_static.dir/Tensor.c.o
[ 26%] Building C object CMakeFiles/cutorch_static.dir/TensorMath.c.o
[ 30%] Building C object CMakeFiles/cutorch_static.dir/torch/utils.c.o
[ 30%] Building C object CMakeFiles/cutorch_static.dir/TensorOperator.c.o
[ 32%] Linking C static library libcutorch.a
[ 32%] Built target cutorch_static
1 error generated.
CMake Error at THC_generated_THCTensorSort.cu.o.cmake:207 (message):
Error generating
/tmp/luarocks_cutorch-scm-1-4865/cutorch/build/lib/THC/CMakeFiles/THC.dir//./THC_generated_THCTensorSort.cu.o

make[2]: *** [lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorSort.cu.o] Error 1
1 error generated.
CMake Error at THC_generated_THCStorage.cu.o.cmake:207 (message):
Error generating
/tmp/luarocks_cutorch-scm-1-4865/cutorch/build/lib/THC/CMakeFiles/THC.dir//./THC_generated_THCStorage.cu.o

make[2]: *** [lib/THC/CMakeFiles/THC.dir/THC_generated_THCStorage.cu.o] Error 1
make[1]: *** [lib/THC/CMakeFiles/THC.dir/all] Error 2
make: *** [all] Error 2

Error: Failed installing dependency: https://raw.githubusercontent.com/torch/rocks/master/cutorch-scm-1.rockspec - Build error: Failed building.

gcc -v:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/usr/include/c++/4.2.1
Apple LLVM version 6.1.0 (clang-602.0.49) (based on LLVM 3.6.0svn)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

Can't install cunn with luarocks because of missing file

[joonazan@arkkikaari char-rnn]$ luarocks install cunn
Installing https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec... switching to 'build' mode
Klone nach 'cunn' ...
remote: Counting objects: 49, done.
remote: Compressing objects: 100% (38/38), done.
remote: Total 49 (delta 15), reused 21 (delta 7), pack-reused 0
Empfange Objekte: 100% (49/49), 56.13 KiB | 0 bytes/s, Fertig.
Löse Unterschiede auf: 100% (15/15), Fertig.
Prüfe Konnektivität ... Fertig.
cmake -E make_directory build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/home/joonazan/torch/install/bin/.." -DCMAKE_INSTALL_PREFIX="/home/joonazan/torch/install/lib/luarocks/rocks/cunn/scm-1" && make -j$(getconf _NPROCESSORS_ONLN) install

-- The C compiler identification is GNU 5.1.0
-- The CXX compiler identification is GNU 5.1.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Torch7 in /home/joonazan/torch/install
-- Found CUDA: /opt/cuda (found suitable version "7.0", minimum required is "4.0") 
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/luarocks_cunn-scm-1-2819/cunn/build
[100%] Building NVCC (Device) object CMakeFiles/cunn.dir/cunn_generated_init.cu.o
In file included from /home/joonazan/torch/install/include/THC/THCApply.cuh:5:0,
                 from /tmp/luarocks_cunn-scm-1-2819/cunn/HardTanh.cu:2,
                 from /tmp/luarocks_cunn-scm-1-2819/cunn/init.cu:14:
/home/joonazan/torch/install/include/THC/THCReduceApplyUtils.cuh:9:30: schwerwiegender Fehler: THCDeviceUtils.cuh: Datei oder Verzeichnis nicht gefunden
Kompilierung beendet.
CMake Error at cunn_generated_init.cu.o.cmake:206 (message):
  Error generating
  /tmp/luarocks_cunn-scm-1-2819/cunn/build/CMakeFiles/cunn.dir//./cunn_generated_init.cu.o


CMakeFiles/cunn.dir/build.make:55: die Regel für Ziel „CMakeFiles/cunn.dir/cunn_generated_init.cu.o“ scheiterte
make[2]: *** [CMakeFiles/cunn.dir/cunn_generated_init.cu.o] Fehler 1
CMakeFiles/Makefile2:60: die Regel für Ziel „CMakeFiles/cunn.dir/all“ scheiterte
make[1]: *** [CMakeFiles/cunn.dir/all] Fehler 2
Makefile:116: die Regel für Ziel „all“ scheiterte
make: *** [all] Fehler 2

Error: Build error: Failed building.

Limit on input Tensor size using cunn

Code to reproduce the error:

require 'cunn'
require 'cutorch'
cutorch.setDevice(1)
model=nn.Sequential():add(nn.Linear(300, 500)):add(nn.LogSoftMax()):cuda()
batch_size=90000
output=model:forward(torch.rand(batch_size,300):float():cuda())

The error:

/home/tushar/torch/install/share/lua/5.1/nn/Sequential.lua:44: invalid argument at /tmp/luarocks_cunn-scm-1-144/cunn/LogSoftMax.cu:249
stack traceback:
    [C]: in function 'updateOutput'
    /home/tushar/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    [string "_RESULT={model:forward(torch.rand(batch_size,..."]:1: in main chunk
    [C]: in function 'xpcall'
    /home/tushar/torch/install/share/lua/5.1/trepl/init.lua:630: in function 'repl'
    ...shar/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
    [C]: at 0x00406670  
                                                                      [0.2761s]

Large batch sizes work fine on CPU. Smaller batch sizes (<80k) work fine on GPU.

smth chntla on the issue:

It must be that the cuda launch parameters (number of blocks/threads) were configured without such large batch sizes in mind

Need help for calculating convolution output with stride

The convolution network modules part:

3 :
{
padding : 3
kW : 7
nInputPlane : 8
gradBias : CudaTensor - size: 8
dW : 1
gradWeight : CudaTensor - size: 8x392
output : CudaTensor - size: 1x8x61x61
fgradInput : CudaTensor - size: 61x61
finput : CudaTensor - size: 392x3721
bias : CudaTensor - size: 8
weight : CudaTensor - size: 8x392
nOutputPlane : 8
gradInput : CudaTensor - empty
kH : 7
dH : 1
}
4 :
{
padding : 2
kW : 7
nInputPlane : 8
gradBias : CudaTensor - size: 12
dW : 1
gradWeight : CudaTensor - size: 12x392
output : CudaTensor - size: 1x12x59x59
fgradInput : CudaTensor - size: 59x59
finput : CudaTensor - size: 392x3481
bias : CudaTensor - size: 12
weight : CudaTensor - size: 12x392
nOutputPlane : 12
gradInput : CudaTensor - empty
kH : 7
dH : 1

}

As is listed above, the output size of layer 3 is 8x61x61

The layer 4 is defined as nn.SpatialConvolutionMM(8,12,7,7,1,1,3))

from which we should expected output is

(61+3-7)/1+1 = 58,

however, the module info shows that the output is 12x59x59.

Anyone can help me figure out this?

thx~

No MultiLabelMarginCriterion

Hi guys,
I noticed there is no cunn implementation of this criterion.
Do you plan to implement that any time soon?
We could try implementing it ourselves otherwise or just keep the criterion on CPU.
Best,
D.

Volumetric Convolutions

@nouiz pointed out that they have excellent Volumetric convolutions in theano, recently implemented by @stencilman and others.
The code is Caffe-style, 500 lines of portability.
https://github.com/ballasn/Theano/blob/Corr3DMM/theano/sandbox/cuda/corr3d_gemm.cu

We can get these in to cunn as well.

if anyone wants to put their hands on this ping here and claim the task. (You can pretty much do copy-pasta programming here, starting with SpatialConvolutionMM). The work involved is about 1-2 hours.

If no one does it, I will get around to this probably this weekend (or when I find time),

bug at require 'cunn'

A month ago, the same code can run perfectly fine, but today when I checked again, I got this error, any ideas or hints?

==> switching to CUDA   
/usr/local/bin/luajit: unable to initialize cublas
stack traceback:
        [C]: at 0x7f19499a5f50
        [C]: in function 'require'
        /usr/local/share/lua/5.1/cutorch/init.lua:2: in main chunk
        [C]: in function 'require'
        /usr/local/share/lua/5.1/cunn/init.lua:1: in main chunk
        [C]: in function 'require'
        cifardoall.lua:60: in main chunk
        [C]: in function 'dofile'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:109: in main chunk
        [C]: at 0x00404480

When memory pressure is high, THCStorage.cu resize() algo use device2device copy , will cause out of memory crash.

the code logic was like:

float *data;
THCudaCheck(cudaMalloc((void**)(&data), size * sizeof(float)));
THCudaCheck(cudaMemcpyAsync(data, self->data, THMin(self->size, size) * sizeof(float), cudaMemcpyDeviceToDevice));
THCudaCheck(cudaFree(self->data));

when considering scenario:
GPU have 4G RAM, but 3G used, if lua call resize() the storage from 3G to 3.5G, above code will crash by out of memory, but actually the GPU still have 1G spare ram.

the optimized logic should be like:
if device ram is not enough to malloc, first , copy current data from device to host, then release device ram , after that malloc new device ram , finally copy by the content from host to device.

Please consider this request's importance b/c device ram always very tight. it should be better if release ahead of malloc.

cuda convolutional autoencoder (spatialConvolutionCUDA)

hi guys,
in the "koraykv/unsup" package there is a conv-psd autoencoder, i was trying to do something similar using cunn, but the spatialConvolutionCUDA module only allows output feature maps that are multiples of 16, so i wasn't able to make a decoder. does anybody know how to work around this?
thanks a lot

SpatialMaxPooling failure case

Fails for the case:

m=nn.SpatialMaxPooling(2,2,2,2):cuda()
input=torch.randn(128,1024,12,12):cuda()
m:forward(input)
error in SpatialMaxsampling.updateOutput: invalid argument
luajit: ...ch-distro/install/share/lua/5.1/nn/SpatialMaxPooling.lua:18: aborting

Nondeterministic behaviour in SpatialMaxPooling

I have encountered nondeterministic behavior in the backpropagation step of SpatialMaxPooling when kW~=dW or kH~=dH (i.e. atomicmaxgradinput kernel is run). I suspect the issue being caused by atomicAdd() and general non-associativity of floats. Thus, I'm fairly sure that this is a feature of parallelism but somehow I wanted to make you aware of it (it's scary)...

The following code works on my machine (GTX Titan, CUDA 346.46) when the GPU is under some load. If I let the computation run on CPU wit FloatTensors, it's deterministic.

local inp = torch.CudaTensor(1,18,18):zero()
inp[1][3][3] = 1 --will force adding up 3 numbers

fw = torch.CudaTensor(1,8,8):fill(0)
fw[1][2][1] = -0.00055786536540836
fw[1][2][2] = 0.00075417151674628
fw[1][1][2] = -0.00029314149287529

local model = nn.SpatialMaxPooling(3, 3, 2, 2):cuda()    
model:forward(inp)
local bw = model:backward(inp, fw):clone()

for i=1,100 do
    local diff = bw - model:backward(inp, fw)
    print(diff:sum()) --sometimes, the input is not zero!
end

Note further that

local a = (-0.00055786536540836 + 0.00075417151674628) -0.00029314149287529
local b = 0.00075417151674628 + (-0.00029314149287529 -0.00055786536540836)
print(a-b)  -- not zero

On a slightly related note, is there any reason for calling atomicmaxgradinput instead of maxgradinput on SpatialMaxPooling.cu:320?

SpatialMaxPooling input data format bug

Hi, all~
Currently, I'm looking for torch API which has stride mechanism for convolution and pooling and I got nn.SpatialConvolutionCUDA and nn.SpatialMaxPoolingCUDA ported from cuda-convnet

The SptialConvolutionCUDA takes a BHWD input format, while SpatialMaxPoolingCUDA not.
I've got error when add max-pooling after conv funciton.

static int cunn_SpatialMaxPoolingCUDA_updateOutput(lua_State *L)
{
  THCudaTensor *input = (THCudaTensor *)luaT_checkudata(L, 2, "torch.CudaTensor");
  int kW = luaT_getfieldcheckint(L, 1, "kW");
  int kH = luaT_getfieldcheckint(L, 1, "kH");
  int dW = luaT_getfieldcheckint(L, 1, "dW");
  int dH = luaT_getfieldcheckint(L, 1, "dH");

  THCudaTensor *output = (THCudaTensor *)luaT_getfieldcheckudata(L, 1, "output", "torch.CudaTensor");

  luaL_argcheck(L, input->nDimension == 4, 2, "4D (batch) tensor expected");

  long nInputCols = input->size[2];
  long nInputRows = input->size[1];
  long nInputPlane = input->size[0];
  long batchSize = input->size[3];
  long nOutputCols = (nInputCols - kW) / dW + 1;
  long nOutputRows = (nInputRows - kH) / dH + 1;

This code snippet should be reivsed like this

 long batchSize = input->size[0]; 
 long nInputRows = input->size[1];
 long nInputCols = input->size[2];
 long  nInputPlane = input->size[3];

At this moment, the SpatialMaxPoolingCUDA will take effect only if we change ouput format from BHWD to DHWB, OMG.....

ClassNLLCriterion cuda with weights is slow

If the batch size is large and you include per-class weights than ClassNLLCriterion is SUPER slow. Every tensor dereference it's pulling data back to the CPU. In my case I have 3 classes, batch size of 10,000 (yes I know this is high) and forward + backward on the criterion more than 100x slower than the network.

Now that I'm at Google I need to get permission to fix this, so it might take a while. But if someone has any spare cycles I would be extremely grateful if they can fix it :-)

mse criterion returns negative values when running in parallel on the same gpu

I'm using nn.MSECriterion and I've noticed that the following code results with negative value in very rare cases:
local err = criterion:forward(output, targets)

It happened to me in very rare cases (3-4 times in the last year) - in all cases I trained two networks at the same time over the same gpu.
Note that it happened after thousands of epochs and it is very hard to reproduce.

If someone knows why it can happen or how to fix/avoid it - please let me know.

Thanks in advance,
Itzik.

cunn/SoftMax does not support images with 2^16 or more elements

The nn version works

model = nn.SoftMax()
x = torch.randn(1, 2, 65536, 1)
y = model:forward(x)

however the cunn version fails

model = model:cuda()
x = x:cuda()
y = model:forward(x)

with error

nngraph/gmodule.lua:281: invalid configuration argument at [...]/torch/extra/cunn/SoftMax.cu:153

This error does not occur when the image contains 65535 elements.

Bug in Cuda version of nn.PReLU

Hi,

There seems to be a small bug in the cuda version of nn.PReLU. Basically the batch and data axes seem to be confused. CPU version works well. Here is an isolated example:

th> prelu = nn.PReLU(3)

th> prelu.weight = torch.FloatTensor({2, 4, 6})

th> test = torch.ones(2, 3):mul(-1)

th> test
-1 -1 -1
-1 -1 -1
[torch.FloatTensor of size 2x3]

th> prelu:forward(test) -- OK
-2 -4 -6
-2 -4 -6
[torch.FloatTensor of size 2x3]

th> prelu:cuda():forward(test:cuda()) -- bug
-2 -2 -4
-4 -6 -6
[torch.CudaTensor of size 2x3]

Should I fix it by submitting a patch to PReLU.cu?

Size mismatch errors ignored: THCudaTensor_pointwiseApply* retvals need checking

None of the cunn modules that use THCudaTensor_pointwiseApply2 or THCudaTensor_pointwiseApply3 check the retvals, and so size inconsistency errors are silently ignored.

The functions return false if the tensor has too many dimensions, or if the sizes of the tensors mismatch.

I think there was a reason at the time why the pointwise functions did not throw Lua errors (I think it was because the caller had more contextual information about what the error should be, instead of a non-descriptive size mismatch). Either the retvals should be checked in cunn, or we should add Lua errors in cutorch's pointwiseApply*.

https://github.com/torch/cunn/search?utf8=%E2%9C%93&q=THCudaTensor_pointwiseApply2&type=Code
https://github.com/torch/cunn/search?utf8=%E2%9C%93&q=THCudaTensor_pointwiseApply3&type=Code

> require 'cunn'
true
                                                                      [2.1727s]
> cpuModule = nn.ReLU(false)
                                                                      [0.0000s]
> gpuModule = nn.ReLU(false):cuda()
                                                                      [0.0002s]
> cpuInput = torch.DoubleTensor(8):fill(-1)
                                                                      [0.0000s]
> cpuGradOutput = torch.DoubleTensor(9):uniform()
                                                                      [0.0000s]
> gpuInput = cpuInput:cuda()
                                                                      [0.1500s]
> gpuGradOutput = cpuGradOutput:cuda()
                                                                      [0.0003s]
> cpuModule:updateOutput(cpuInput)                                                                                                          
 0
 0
 0
 0
 0
 0
 0
 0
[torch.DoubleTensor of size 8]

> cpuModule:updateGradInput(cpuInput, cpuGradOutput)
...plearning/torch/cuth.llar.linktree/_lua/nn/Threshold.lua:26: inconsistent tensor size at torch/oss/nn/generic/Threshold.c:48
stack traceback:
        ...learning/torch/cuth.llar.linktree/_lua/fb/util/error.lua:76: in function <...learning/torch/cuth.llar.linktree/_lua/fb/util/error
.lua:72>
        [C]: in function 'Threshold_updateGradInput'
        ...plearning/torch/cuth.llar.linktree/_lua/nn/Threshold.lua:26: in function 'updateGradInput'
        [string "_RESULT={cpuModule:updateGradInput(cpuInput, ..."]:1: in function 'inner_func'
        ...learning/torch/cuth.llar.linktree/_lua/fb/trepl/init.lua:492: in function <...learning/torch/cuth.llar.linktree/_lua/fb/trepl/ini
t.lua:492>
...

> gpuModule:updateOutput(gpuInput)
 0
 0
 0
 0
 0
 0
 0
 0
[torch.CudaTensor of size 8]

                                                                      [0.0006s]
> gpuModule:updateGradInput(gpuInput, gpuGradOutput)
 0
 0
 0
 0
 0
 0
 0
 0
[torch.CudaTensor of size 8]

                                                                      [0.0004s]

Non-deterministic LookupTable

The backward pass of the LookupTable seems to be non-deterministic on GPU, and different to the CPU implementation. Is this expected?

require 'torch'                                                                 
require 'cutorch'                                                               
require 'nn'                                                                    
require 'cunn'                                                                  

do                                                                              
  local lt = nn.LookupTable(4096, 256):cuda()                                   
  lt.weight:fill(1)                                                             
  lt.gradWeight:fill(0)                                                         
  local input = torch.CudaTensor(4000):fill(1)                                  
  lt:forward(input)                                                             
  print(lt.output:sum())                                                        
  lt:backward(input, lt.output)                                                 
  print(lt.gradWeight:sum())                                                    
end                                                                             

do                                                                              
  local lt = nn.LookupTable(4096, 256)                                          
  lt.weight:fill(1)                                                             
  lt.gradWeight:fill(0)                                                         
  local input = torch.DoubleTensor(4000):fill(1)                                
  lt:forward(input)                                                             
  print(lt.output:sum())                                                        
  lt:backward(input, lt.output)                                                 
  print(lt.gradWeight:sum())                                                    
end

Output:

1024000
641312     -- this varies
1024000
1024000

Discrepency in output dimension while using VolumetricConvolution in CUDA from normal CPU version

Running VolumetricConvolution in CUDA mode seems to alternate the dimension of the output tensor.
For example, I have an input batch of 5D as follows:

input = torch.Tensor(20, 2, 32, 64, 64)

I only have one layer as follows:

model:add(nn.VolumetricConvolution(2, 16, 1, 5, 5))

In CPU mode, doing a forward pass produces the correct output of dimension (20, 16, 32, 60, 60).
But in CUDA the dimensions get swapped and I get the output dimension as (20, 16, 60, 32, 60), which shouldn't be the case.

Ideally, both should produce the same output.

Padding is eaten away

I'm playing with float and cuda networks, and now I get this

th> model.modules[1].padding = 0
                                                                      [0.0001s]
th> model.modules[1].padding                                                                                                           
0
                                                                      [0.0001s]
th> model.modules[1]:forward(torch.CudaTensor(3,224,244))                                                                              
[string "_RESULT={model.modules[1]:forward(torch.CudaT..."]:1: bad argument #1 (field padding does not exist)
stack traceback:
        [C]: in function 'forward'
        [string "_RESULT={model.modules[1]:forward(torch.CudaT..."]:1: in main chunk
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/trepl/init.lua:630: in function 'repl'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
        [C]: at 0x00406260
                                                                      [0.0008s]
th> model.modules[1].padding                                                                                                           
                                                                      [0.0000s]
th> 

How could I go over this?

VolumetricConvolution tests fail

Ubuntu, all VolumetricConvolution tests fail with different outputs.
examples:
1)

> nn.testcuda()
Running 81 tests
___|_____________________________________________________________________________  ==> VolumetricConvolution_forward_batch*** Error in `/usr/local/bin/luajit': double free or corruption (!prev): 0x0000000011762b50 ***
Aborted (core dumped)
> nn.testcuda()
Running 81 tests
___|_____________________________________________________________________________  ==> VolumetricConvolution_forward_batch$ Invalid argument 4: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
Segmentation fault (core dumped)
> nn.testcuda()
Running 81 tests
___|_____________________________________________________________________________  ==> VolumetricConvolution_forward_batch$ Invalid argument 2: invalid number of input planes
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
$ Invalid argument 2: out of range
Segmentation fault (core dumped)

SoftMax totally wrong

require 'cunn'

values = {5.4334, 3.5017, 1.1042, -4.9523, 1.7309, 2.4248, 3.0710, -2.9635, 2.5460, -1.7663, 0.5645, 12.3819, -3.1641, -3.6385, 0.5742, -1.3508, -2.2765}

floatInput = torch.FloatTensor(values)
floatSoftMax = nn.SoftMax():float()

cudaInput = torch.CudaTensor(values)
cudaSoftMax = nn.SoftMax():cuda()

print(floatSoftMax:forward(floatInput))
print(cudaSoftMax:forward(cudaInput))

returns

atcold@elab-GPU1 ~ $ th bug.lua 
 0.0010
 0.0001
 0.0000
 0.0000
 0.0000
 0.0001
 0.0001
 0.0000
 0.0001
 0.0000
 0.0000
 0.9986
 0.0000
 0.0000
 0.0000
 0.0000
 0.0000
[torch.FloatTensor of dimension 17]

 9.5879e-04
 1.3893e-04
 1.2635e-05
 2.9598e-08
 2.3645e-05
 4.7326e-05
 9.0312e-05
 2.1627e-07
 5.3424e-05
 7.1603e-07
 7.3652e-06
 9.9866e-01
 1.7696e-07
 1.1011e-07
 7.4370e-06
 1.0849e-06
 4.2989e-07
[torch.CudaTensor of dimension 17]

TemporalMaxPooling breaks with >1024 features

If the output frame size of a cunn TemporalMaxPooling layer is greater than 1024, the output is no longer correct. Compare the CPU and GPU output of the code below:

require('torch')
require('cutorch')
require('cunn')

-- ConvNet parameters
kW = 20; dW = 1; maxPool = 4;
numFeatures1 = 20


print("Loading data...")
docData = torch.randn(4200, 530) --this data is transposed from the usual data format
nSamples = docData:size()[1]
nExamples = docData:size()[2]
n = nSamples


print("Creating neural network...")

-- ConvNet
n1 = (n-kW)/dW+1
n2 = torch.floor(n1/maxPool)

cpuFeatsNet = nn.Sequential()
tConvLayer = nn.TemporalConvolution(1, numFeatures1, kW)
cpuFeatsNet:add(tConvLayer)
cpuFeatsNet:add(nn.TemporalMaxPooling(maxPool))

gpuFeatsNet = nn.Sequential()
tConvLayerGpu = nn.TemporalConvolution(1, numFeatures1, kW)
gpuFeatsNet:add(tConvLayerGpu)
gpuFeatsNet:add(nn.TemporalMaxPooling(maxPool))
gpuFeatsNet:cuda()

batchSize = 530
numProcSamples = 530

tConvLayer.bias:zero()
tConvLayerGpu.bias:zero()
tConvLayerGpu.weight[{}] = tConvLayer.weight

function saveNewFeatures(krnToSave, procUnit)
  procUnit = procUnit or "cpu"

  if procUnit == "cpu" then
    featsNet = cpuFeatsNet
    print("Processing on the CPU...")
  elseif procUnit == "gpu" then
    featsNet = gpuFeatsNet
    cudaInp = torch.CudaTensor(batchSize,nSamples,1)
    print("Processing on the GPU...")
  end

  local feats = torch.Tensor(numProcSamples, n2)

  for i=1,numProcSamples,batchSize do
    tIdx = i
    inp = docData[{{1,nSamples},{tIdx,tIdx+batchSize-1}}]:t():reshape(batchSize,nSamples,1):double()

    if procUnit == "cpu" then
      feats[{{i,i+batchSize-1},{}}] = featsNet:forward(inp)[{{},{},krnToSave}]
    elseif procUnit == "gpu" then
      cudaInp[{}] = inp --use same cuda memory instead of allocating new
      cudaOut = featsNet:forward(cudaInp)[{{},{},krnToSave}]
      feats[{{i,i+batchSize-1},{}}] = cudaOut:double()
    end
  end

  return feats:t()
end

cpuFeats = saveNewFeatures(15, 'cpu')
gpuFeats = saveNewFeatures(15, 'gpu')
diff = cpuFeats-gpuFeats

print("Here is the difference between the output of a CPU net and a GPU net, for an arbitrary feature (15 in this case) from index 1020 to 1030. Note the errors beginning at index 1025.")
print(diff[{{1020,1030},1}])

nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5

Getting unstable results with nn.testcuda(). Sometimes passes, sometimes fails, sometime segfaults. Ran tests due to cpu->gpu results discrepency for identical scripts and data.

Running Macbook Pro Retina 10,1 (mid 2012).
Yosemite 10.10.1, CUDA 6.5 - latest drivers and libs as of 12/1/14.
Re-installed today - as part of ongoing effort to solve cpu/gpu discrepancies - described at end.
Latest Torch7 install - using '2 line' scripts from Torch.ch. Used Clang 6.0 as CUDA 6.5 is incompatible with gcc49. Am not clear how scripts deal with libstdc++ issues.
Ran dependencies script as normal admin user.
Ran luajit-torch script using sudo -s.
This fails to build a loadable cunn properly. Local build fix did not work, due to cmake 3.0.2 changes in rpath handling. Edited FindCUDA.cmake as recommended - produced loadable libcunn.so.

Attached terminal sessions shows a common failure mode. Repeated testing shows passing, passing with significant delays, and failures ranging from failing a single test, to segfault, to out of memory.

testcuda fails

Some background: have been struggling for 2-3 weeks trying to get cpu and gpu results to match. Have reinstalled all Torch components as well as CUDA numerous times. Did experiments with setting manualSeed(). Found that each platform produced repeatable results, but none of them matched. This is cpu and gpu on OSX, Ubuntu 14.04, and CENTOS 6.6. Timing differences with and without gpu are also inconsistent. Feels to me that this could be some kind of an install issue - but after having built the environment from scratch numerous times, am in the dark as to what it might be.

cunn installation failing with malformed object

installation of cunn on Mac OS X (Yosemite) fails with the following error

Installing https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec... switching to 'build' mode
Cloning into 'cunn'...
remote: Counting objects: 50, done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 50 (delta 19), reused 35 (delta 12)
Receiving objects: 100% (50/50), 75.57 KiB | 0 bytes/s, done.
Resolving deltas: 100% (19/19), done.
Checking connectivity... done.
cmake -E make_directory build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/usr/local/bin/.." -DCMAKE_INSTALL_PREFIX="/usr/local/lib/luarocks/rocks/cunn/scm-1" && make

-- The C compiler identification is GNU 4.9.1
-- The CXX compiler identification is GNU 4.9.1
-- Checking whether C compiler has -isysroot
-- Checking whether C compiler has -isysroot - yes
-- Checking whether C compiler supports OSX deployment target flag
-- Checking whether C compiler supports OSX deployment target flag - yes
-- Check for working C compiler: /usr/local/bin/gcc-4.9
-- Check for working C compiler: /usr/local/bin/gcc-4.9 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Checking whether CXX compiler has -isysroot
-- Checking whether CXX compiler has -isysroot - yes
-- Checking whether CXX compiler supports OSX deployment target flag
-- Checking whether CXX compiler supports OSX deployment target flag - yes
-- Check for working CXX compiler: /usr/local/bin/g++-4.9
-- Check for working CXX compiler: /usr/local/bin/g++-4.9 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found Torch7 in /usr/local
-- Found CUDA: /Developer/NVIDIA/CUDA-6.5 (Required is at least version "4.0")
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/luarocks_cunn-scm-1-6051/cunn/build
[100%] Building NVCC (Device) object CMakeFiles/cunn.dir//./cunn_generated_init.cu.o
Scanning dependencies of target cunn
Linking CXX shared module libcunn.so
[100%] Built target cunn
cd build && make install
[100%] Built target cunn
Install the project...
-- Install configuration: "Release"
-- Installing: /usr/local/lib/luarocks/rocks/cunn/scm-1/lib/libcunn.so
/opt/local/bin/install_name_tool: object: /usr/local/lib/luarocks/rocks/cunn/scm-1/lib/libcunn.so malformed object (load command 23 cmdsize is zero)
/opt/local/bin/install_name_tool: object: /usr/local/lib/luarocks/rocks/cunn/scm-1/lib/libcunn.so malformed object (load command 23 cmdsize is zero)
-- Installing: /usr/local/lib/luarocks/rocks/cunn/scm-1/lua/cunn/init.lua
-- Installing: /usr/local/lib/luarocks/rocks/cunn/scm-1/lua/cunn/test.lua
Updating manifest for /usr/local/lib/luarocks/rocks
cunn scm-1 is now built and installed in /usr/local/ (license: BSD)

usage cunn also fails with similar error, there was another issue (torch/cutorch#66) that reported the same but it was closed without any resolution or help on how to resolve it

cannot install cunn

Let me start by saying I am new to Torch7 and Lua. I have Torch7 installed under Ubuntu 14.04. I have successfully installed the nn package. Now, I and am trying to install the cunn package using:
luarocks install cunn

The following is the output of my install:
kzachery@DELL:~$ luarocks install cunn
Installing https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec... switching to 'build' mode
Cloning into 'cunn'...
remote: Counting objects: 47, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 47 (delta 18), reused 30 (delta 16), pack-reused 0
Receiving objects: 100% (47/47), 49.04 KiB | 0 bytes/s, done.
Resolving deltas: 100% (18/18), done.
Checking connectivity... done.
cmake -E make_directory build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/home/kzachery/torch/install/bin/.." -DCMAKE_INSTALL_PREFIX="/home/kzachery/torch/install/lib/luarocks/rocks/cunn/scm-1" && make -j$(getconf _NPROCESSORS_ONLN) install

-- The C compiler identification is GNU 4.8.2
-- The CXX compiler identification is GNU 4.8.2
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found Torch7 in /home/kzachery/torch/install
-- Found CUDA: /usr/local/cuda (found suitable version "7.0", minimum required is "4.0")
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/luarocks_cunn-scm-1-9119/cunn/build
[100%] Building NVCC (Device) object CMakeFiles/cunn.dir//./cunn_generated_init.cu.o
In file included from /tmp/luarocks_cunn-scm-1-9119/cunn/init.cu:14:0:
/tmp/luarocks_cunn-scm-1-9119/cunn/HardTanh.cu:2:24: fatal error: THCApply.cuh: No such file or directory
#include "THCApply.cuh"
^
compilation terminated.
CMake Error at cunn_generated_init.cu.o.cmake:206 (message):
Error generating
/tmp/luarocks_cunn-scm-1-9119/cunn/build/CMakeFiles/cunn.dir//./cunn_generated_init.cu.o

make[2]: *** [CMakeFiles/cunn.dir/./cunn_generated_init.cu.o] Error 1
make[1]: *** [CMakeFiles/cunn.dir/all] Error 2
make: *** [all] Error 2

Error: Build error: Failed building.

What am I missing in order to get this package installed? Any help you could provide would be greatly appreciated.

C++ exception with nn.Tanh layer

I managed to install cutorch and cunn with OS X 10.9.4, CUDA 6.5 and cmake 3.0.1, with the small changes described in ticket 27 of the cutorch repository.

However, I now have some problems running the forward method when my neural network includes a nn.Tanh layer. It works with a nn.Linear layer, but if I add the hyperbolic tangent I then get a C++ Exception.

Is there a way to see more details about what was the root cause of the exception? In the "th" interactive tool I don´t get further details, just a "C++ Exception" error message.

Did this happen to anyone else?

Thanks

SpatialFullConvolution test flaky

The SpatialFullConvolution tests sometimes fail due to invalid configurations:

SpatialFullConvolution_forward_batch
 Function call failed 
.../cunn/test.lua:785: bad argument #1 to 'forward' (3D or 4D (batch mode) tensor is expected)

It can be reproduced when running

torch -lcunn -e "nn.testcuda({'SpatialFullConvolution_forward_batch','SpatialFullConvolution_backward_batch'}, false, 1, 1449228794)"

The input sizes don't make sense: 16,14, 0.33333333333333, 2.3333333333333

I've only observed this in cunn, but I suspect the same could happen in nn.

Sanitizing conv modules

Hey all,

I've been testing SpatialConvolutionMM quite a bit, and as for the CPU version at the time, it's replacing all other conv modules for me at this point. Questions:

  • can we get rid of SpatialConvolution, for which the perf is ridiculously low
  • can we get rid of SpatialConvolutionMap, which is half-implemented (and not sure anybody uses this type of module)
  • can we get rid of the special SpatialConvolutionCUDA and SpatialMaxPoolingCUDA modules, which were only here temporarily (Alex's kernels)

I vote yes for the first 2 questions, ok to keep the 3rd alive for a bit more since people might depend on them.

SpatialMaxsampling: too many resources requested for launch

I'm running a very tiny network (3-layer ConvNet with few tens of fiters) on a Nvidia Jetson TK1 with Ubuntu 14.04, Cuda 6.0 and the latest drivers (R19.3.0_armhf).
If I use the CPU, everything works fine and the feedforward step is completed in 10s of ms.
If I try to use the GPU (using nn.SpatialConvolutionMM and nn.SpatialMaxPooling), I get the following error message I do not understand:

error in SpatialMaxsampling.updateOutput: too many resources requested for launch
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/SpatialMaxPooling.lua:18: aborting   
stack traceback:                                                                        
        [C]: in function 'SpatialMaxPooling_updateOutput'                               
        /usr/local/share/lua/5.1/nn/SpatialMaxPooling.lua:18: in function 'updateOutput'
        /usr/local/share/lua/5.1/nn/Sequential.lua:37: in function 'forward'            
        ./src/profileNet.lua:32: in function 'time'                                     
        general-profiler.lua:33: in main chunk                                          
        [C]: in function 'dofile'                                                       
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:129: in main chunk             
        [C]: at 0x0000cf89                                                              

Running the same code on a Intel machine with Ubuntu 14.04 and a GeForce GTX 780 this doesn't happen. A similar error, though, happened on another Intel machine with Ubuntu 14.04 and a Tesla K40c while trying to run OverFeat.

Moreover, this error does not happen if I load the model and process the SpatialMaxPooling layer only. Instead, if I run all the previous layers, one by one, then it fails at the SpatialMaxPooling one with the "too many resources requested for launch" Cuda error message.

I believe it has to do something with the registers available per block. Perhaps the TK1 has more limited resources and the SpatialMaxsampling is trying to allocate too many registers.

problem with installation

trying to install cunn I have the following problem (error at the bottom). any ideas how to fix it? :

root@yitzhak-VirtualBox:/media/sf_Dima/Projects_Torch/examples_from_github/eladhoffer/ConvNet-torch-master# luarocks install cunn
Installing https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec... switching to 'build' mode

Missing dependencies for cunn:
cutorch >= 1.0

Using https://raw.githubusercontent.com/torch/rocks/master/cutorch-scm-1.rockspec... switching to 'build' mode
Cloning into 'cutorch'...
remote: Counting objects: 75, done.
remote: Compressing objects: 100% (71/71), done.
remote: Total 75 (delta 8), reused 22 (delta 1), pack-reused 0
Receiving objects: 100% (75/75), 106.22 KiB | 181.00 KiB/s, done.
Resolving deltas: 100% (8/8), done.
Checking connectivity... done.
cmake -E make_directory build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/home/yitzhak/torch/install/bin/.." -DCMAKE_INSTALL_PREFIX="/home/yitzhak/torch/install/lib/luarocks/rocks/cutorch/scm-1" && make -j$(getconf _NPROCESSORS_ONLN) install

-- The C compiler identification is GNU 4.8.2
-- The CXX compiler identification is GNU 4.8.2
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found Torch7 in /home/yitzhak/torch/install
CMake Error at /usr/share/cmake-2.8/Modules/FindCUDA.cmake:548 (message):
Specify CUDA_TOOLKIT_ROOT_DIR
Call Stack (most recent call first):
CMakeLists.txt:7 (FIND_PACKAGE)

-- Configuring incomplete, errors occurred!
See also "/tmp/luarocks_cutorch-scm-1-1323/cutorch/build/CMakeFiles/CMakeOutput.log".

Error: Failed installing dependency: https://raw.githubusercontent.com/torch/rocks/master/cutorch-scm-1.rockspec - Build error: Failed building.

SpatialConvolution GPU vs CPU

Hi,

Can someone explain to me why this fails:

require 'cunn'
mytester = torch.Tester()
tests = {}
function tests.test() 
   local input = torch.randn(3,32,32)
   local cnn = nn.Sequential()
   cnn:add(nn.SpatialConvolution(3,8,5,5))
   cnn:add(nn.ReLU())
   cnn:add(nn.SpatialAveragePooling(2,2,2,2))
   cnn:add(nn.SpatialConvolution(8,12,5,5))
   cnn:add(nn.ReLU())
   cnn:add(nn.SpatialAveragePooling(2,2,2,2))
   local outsize = cnn:forward(input):size()
   cnn:add(nn.Reshape(outsize[1]*outsize[2]*outsize[3]))
   cnn:add(nn.Linear(outsize[1]*outsize[2]*outsize[3],20))
   cnn:add(nn.ReLU())
   cnn:add(nn.Linear(20,10))
   local output = cnn:forward(input):clone()
   local gradOutput = output:clone()
   local gradInput = cnn:backward(input, gradOutput):clone()
   cnn:float()
   local input3 = input:float()
   local output3 = cnn:forward(input3):clone()
   local gradOutput3 = output3:clone()
   local gradInput3 = cnn:backward(input3, gradOutput3):clone()
   mytester:assertTensorEq(output3:float(), output:float(), 0.000001, "type float fwd err")
   mytester:assertTensorEq(gradInput3:float(), gradInput:float(), 0.00001, "type float bwd err") 
   cnn:cuda()
   local input2 = input3:cuda()
   local gradOutput2 = gradOutput3:cuda()
   local output2 = cnn:forward(input2)
   local gradInput2 = cnn:backward(input2, gradOutput2)
   print(gradInput2[1][1], gradInput[1][1])
   mytester:assertTensorEq(output2:float(), output3, 0.000001, "type cuda fwd err")
   mytester:assertTensorEq(gradInput2:float(), gradInput3, 0.00001, "type cuda bwd err") 
end
mytester:add(tests)
mytester:run()

Yet when I comment out the SpatialConvolution lines, it passes. What is the difference in behavior.

nn.LookupTable() doesn't generate an "out of range" error when run on cuda

I have encountered this issue while working on some word embeddings extraction project. For example, if we tried to get the embedding at an out of range index while running on the CPU it throws an exception. However, when running on GPU it returns some redundant value. See the code chunk below:
On CPU:
ll = nn.LookupTable(5,6)
ll:forward(torch.Tensor(1):fill(100))
result:
index out of range

On GPU:
ll = nn.LookupTable(5,6):cuda()
ll:forward(torch.Tensor(1):fill(100))
result:
nan -2417556347208890920730624.000000 nan 41205700250658325873491584824761122816.000000 nan -0.000000
[torch.CudaTensor of size 1x6]

Any insights about what this value might mean?

CUDA convolution example

Hi,

can you point me to a example application of the CUDA convolution modules? A repo with code that makes use of these would work. If not, that's ok.

--Nick

Test failed with th -lcunn -e "nn.testcuda()"

When I run test.sh, th -lcunn -e "nn.testcuda()" is unstable and occasionally fails with different error messages:

Every time the error message is different, some examples are:

SpatialSubSampling_backward
error on state (backward)
LT(<) violation val=1.2421855926514, condition=0.01
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:1391: in function 'v'

LogSoftMax_forward_batch
error on state (forward)
LT(<) violation val=0.0010080337524414, condition=0.001
/root/torch/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
/root/torch/install/share/lua/5.1/cunn/test.lua:2364: in function 'v'

I guess a similar issue was raised in #50 and solved(maybe?).

I updated to the latest torch and packages.

I use a ubuntu 14.04 docker image and cuda 7.0

Thanks.

require 'cunn' throws error

It seems to me that the file init.lua has changed recently and this change causes an error.
Line 9: "nn.Module._flattenTensorBuffer['torch.CudaTensor'] = torch.FloatTensor.new"
throws error if
"require 'cunn'"
is used.

In particular I get the following error (if I don't remove the line):


.../torch/install/share/lua/5.1/cunn/init.lua:9: attempt to index field '_flattenTensorBuffer' (a nil value)
stack traceback:
.../torch/install/share/lua/5.1/cunn/init.lua:9: in main chunk
[C]: in function 'require'
stdin:1: in main chunk
[C]: at 0x00406670


Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.