odlgroup / odlcuda Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 2.0 367 KB

C++ backend for ODL

License: GNU General Public License v3.0

Python 56.00% CMake 6.75% C++ 15.70% Cuda 21.08% C 0.47%

odlcuda's People

Contributors

Stargazers

Watchers

Forkers

fredrikhellman mehrhardt

odlcuda's Issues

A better python installation

The current install script assumes the user wants to install for the python, and also happens to have a somewhat restrictive licence. When discussing this isuse in STIR with Kris he pointed out a better script, we should swap to that. Togeather with fixing issue #9, this would greatly improve the install process.

odl-cpp-utils doesn't build after Eigen removal

You removed Eigen in
9196001

But it is still referenced in
https://github.com/odlgroup/odl-cpp-utils/blob/master/utils/Ellipse.h

So Jenkins CI can't build the project from scratch, it gets Cannot open include file: 'Eigen/Core' No such file or directory.

Revert the Eigen deletion, or update odl-cpp-utils?

/Lars

MSVC 2012 support

Some users at Elekta "require" msvc 2012 support, which currently fails. We should try to get this working.

@Markus-Eriksson

Remove pyinstall

We should change remove the custom pyinstall and instead use the built in install found in CMake. This is used in our STIR clone.

installation with local odl

I managed to install odlcuda with odl that comes from pip. However, that version is outdated so I would like to use odlcuda with the latest odl version. Following exactly the same installation steps as with the off-the-shelf odl version, my odl cannot find "cuda" as an implementation, e.g.

NotImplementedError: no corresponding data space available for space FunctionSpace(IntervalProd([-333.8016, -333.8016, 0. ], [ 333.8016 , 333.8016 , 257.96875]), out_dtype='float32') and implementation 'cuda'

This is all strange, because it does find my odl version within the installation and this odl version works perfectly fine without cuda. Also my system finds odlcuda as it shows up on auto-complete after import odl. Any ideas what might have gotten wrong here? What is the mechanism that tells odl that odlcuda is present?

Better error with wrong GCC

currently using unmatched GCC and CUDA (particularly GCC 5.x and CUDA<8) gives weird errors. Perhaps we should check for this to help users.

Errors when installing odlcuda with GCC.

We have installed odl and run pytest, no errors. But when we tried to install odlcuda by using this command “CUDA_ROOT=/usr/local/cuda-10.2 CUDA_COMPUTE=75 conda build ./conda”. The following errors occurs,
"conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"gcc[version='<5']"}".
I wonder what the problem is and how to slove this. Thanks a lot.

using odlcuda

This is kind of a continuation of issue odlgroup/odl#1074.

A few observations:

I can import odlcuda. The import also works without _install_location = __file__ but in the following I left it in.
The order matters. I first tried

import odlcuda
import odl

which causes odl not to know 'cuda' but

import odl
import odlcuda

works!

This is not really related to CUDA but still weird: I tried to test timings similar to my application.

domain_cpu = odl.uniform_discr([0], [1], [3e+8], impl='numpy')

and failed.

Traceback (most recent call last):

File "", line 4, in
domain_cpu = odl.uniform_discr([0], [1], [3e+8], impl='numpy')

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1311, in uniform_discr
**kwargs)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1222, in uniform_discr_fromintv
**kwargs)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1136, in uniform_discr_fromspace
nodes_on_bdry)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/partition.py", line 940, in uniform_partition_fromintv
grid = uniform_grid_fromintv(intv_prod, shape, nodes_on_bdry=nodes_on_bdry)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/grid.py", line 1092, in uniform_grid_fromintv
shape = normalized_scalar_param_list(shape, intv_prod.ndim, safe_int_conv)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/util/normalize.py", line 149, in normalized_scalar_param_list
out_list.append(param_conv(p))

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/util/normalize.py", line 396, in safe_int_conv
raise ValueError('cannot safely convert {} to integer'.format(number))

ValueError: cannot safely convert 300000000.0 to integer

Now the CUDA issue: Instead of 1D, I went 3D:

domain_gpu = odl.uniform_discr([0, 0, 0], [1, 1, 1], [4000, 300, 400], impl='cuda')
x_gpu = domain_gpu.one()

error:

Traceback (most recent call last):

File "", line 4, in
x_gpu = domain_gpu.one()

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/discretization.py", line 473, in one
return self.element_type(self, self.dspace.one())

File "/home/me404/.local/lib/python2.7/site-packages/odlcuda-0.5.0-py2.7.egg/odlcuda/cu_ntuples.py", line 912, in one
return self.element_type(self, self._vector_impl(self.size, 1))

RuntimeError: function_attributes(): after cudaFuncGetAttributes: invalid device function

Any idea what is wrong here?

nd arrays

We currently only support 1d arrays. I'll look into how we could improve this to either true Nd array support, or to at least 2d, 3d support.

Add more ufuncs

Numpy supports a long list of ufuncs

we should support all of these.

performance of odlcuda

I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.

There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.

Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape,
                                               impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        %time fx = f(x)
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    %time fx = f(x)

impl:cuda, shape:[4200, 65000], nsubsets:1
CPU times: user 88.2 ms, sys: 24 ms, total: 112 ms
Wall time: 114 ms

impl:cuda, shape:[1050, 65000], nsubsets:4
CPU times: user 325 ms, sys: 28.7 ms, total: 354 ms
Wall time: 361 ms

impl:cuda, shape:[262, 65000], nsubsets:16
CPU times: user 1.43 s, sys: 31.7 ms, total: 1.46 s
Wall time: 1.49 s

impl:cuda, shape:[4200, 65000]
CPU times: user 93.6 ms, sys: 20.2 ms, total: 114 ms
Wall time: 116 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
CPU times: user 10.7 s, sys: 352 ms, total: 11.1 s
Wall time: 6.88 s

impl:numpy, shape:[1050, 65000], nsubsets:4
CPU times: user 13.4 s, sys: 562 ms, total: 14 s
Wall time: 6.9 s

impl:numpy, shape:[262, 65000], nsubsets:16
CPU times: user 24.7 s, sys: 1 s, total: 25.7 s
Wall time: 7.04 s

impl:numpy, shape:[4200, 65000]
CPU times: user 10.8 s, sys: 380 ms, total: 11.1 s
Wall time: 6.92 s

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape, impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        out = Y.element()
        
        t = 0        
        for i in range(len(Y)):
            f_prox = f[i].convex_conj.proximal(x[i])
            src.tic()
            f_prox(x[i], out=out[i])
            t += src.toc()
        print('time:{}, average:{}'.format(t, t / len(Y)))
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    out = Y.element()
    f_prox = f.convex_conj.proximal(x)
    %time f_prox(x, out=out)

impl:cuda, shape:[4200, 65000], nsubsets:1
time:0.607445955276, average:0.607445955276

impl:cuda, shape:[1050, 65000], nsubsets:4
time:0.405075311661, average:0.101268827915

impl:cuda, shape:[262, 65000], nsubsets:16

time:2.36511826515, average:0.147819891572

impl:cuda, shape:[4200, 65000]
CPU times: user 146 ms, sys: 127 µs, total: 146 ms
Wall time: 150 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
time:3.18624901772, average:3.18624901772

impl:numpy, shape:[1050, 65000], nsubsets:4
time:3.12681221962, average:0.781703054905

impl:numpy, shape:[262, 65000], nsubsets:16
time:3.25435972214, average:0.203397482634

impl:numpy, shape:[4200, 65000]
CPU times: user 13.8 s, sys: 1.47 s, total: 15.2 s
Wall time: 3.19 s

Installation issues

Hi when I try to build I get the following error;

CMake Error at CMakeLists.txt:28 (add_dependencies):
  add_dependencies called with incorrect number of arguments


CMake Warning (dev) in CMakeLists.txt:
  No cmake_minimum_required command is present.  A line of code such as

    cmake_minimum_required(VERSION 2.8)

  should be added at the top of the file.  The version specified may be lower
  if you wish to support older CMake versions for this project.  For more
  information run "cmake --help-policy CMP0000".
This warning is for project developers.  Use -Wno-dev to suppress it.

Circular dependency with CUDA

We have a circular dependency wherein odlcuda imports odl, and odl.space..entry_points imports odlcuda. We need to solve this somehow.

Add optimized versions of p-norm and p-dist for p != 2

The standard p-norm can be implemented with the help of the CUDA sum and abs ufuncs, but this involves a copy. We need a C++ implementation if we want efficiency.

For p=inf I currently don't see a way of implementing in Python. We can include that case in the same function in C++, too, but while we're at it, the max() and min() functions would be good to have.

Change from boost to pybind11

PyBind11 is an alternative to boost python that does not require a compiled part (and the rest of boost), this would simplify compilation and distribution by PyPI.

Remove Python enabled flag

Python is always needed so this flag should be removed.

Ufuncs and reductions for all dtypes

Currently they only work with float32, our users would expect them to work with any dtype.

Error when trying to install odlcuda

We installed odl and run test, no issues. Then we tried to install odlcuda. But we had a problem when we ran this command "CUDA_ROOT=/usr/local/cuda-9.0 CUDA_COMPUTE=37 conda build ./conda". Error shows "conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"odl[version='>=0.3.0']"}".
Do you what the problem is?

Boost >= 1.65 not supported

The numeric module has been removed in boost 1.65, thus odlcuda currently doesn't support that version. For building we need version 1.64 or lower.

Build for multiple CUDA compute versions?

Currently AFAICS you can only specify a single CUDA_COMPUTE value. For conda packages it would be good to build for a number of values so the package works for different GPU architectures.
I don't know very much about this topic so I don't know if this is necessary at all or if it's fine to just set the minimum version required. The only thing I know is that the packages on the conda channel are built with 52 and fail for lower versions due to "invalid device function" errors.

slow maximum function

I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?

import odl

X = odl.rn(300 * 10**6, dtype='float32')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

X = odl.rn(300 * 10**6, dtype='float32', impl='cuda')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

CPU times: user 346 ms, sys: 200 µs, total: 346 ms
Wall time: 347 ms
CPU times: user 1.44 s, sys: 0 ns, total: 1.44 s
Wall time: 1.43 s
CPU times: user 838 ms, sys: 341 ms, total: 1.18 s
Wall time: 1.18 s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 91.1 µs

Remove Eigen dependency

Eigen should not be needed as a dependency for the CUDA part.

Installation without admin rights

I have been trying to install odlcuda without admin rights. I have changed the "CMAKE_INSTALL_PREFIX" folder to a folder where I do have write access. This approach worked well to install other software packages. However, odlcuda seems to ignore this path and tries to install it into the default path

/usr/local/lib/python2.7/dist-packages/

instead.

I tried two options how to change the installation path, all of which should work according to some forums, but none does here.

cmake -DCMAKE_INSTALL_PREFIX=~/.local/lib/python2.7/site-packages ../
make DESTDIR=~/.local/lib/python2.7/site-packages pyinstall

Any ideas of what is going on here?

odlcuda broken by tensor branch

I'll see if I can get it back alive, with that said the cupy approach seems far supperior.