hughperkins / coriander Goto Github PK

View Code? Open in Web Editor NEW

832.0 48.0 88.0 7.96 MB

Build NVIDIA® CUDA™ code for OpenCL™ 1.2 devices

License: Apache License 2.0

Shell 1.08% C++ 33.51% Python 7.05% LLVM 46.95% C 2.42% Cuda 7.08% CMake 1.85% Dockerfile 0.05%

opencl gpu coriander mac radeon intel amd nvidia ubuntu

coriander's Introduction

Coriander

Build applications written in NVIDIA® CUDA™ code for OpenCL™ 1.2 devices.

Concept

leave applications in NVIDIA® CUDA™
compile into OpenCL 1.2
run on any OpenCL 1.2 GPU

How to use

Write an NVIDIA® CUDA™ sourcecode file, or find an existing one
Let's use cuda_sample.cu
Compile, using cocl:

$ cocl_py cuda_sample.cu
   ...
   ... (bunch of compily stuff) ...
   ...

    ./cuda_sample.cu compiled into ./cuda_sample

Run:

$ ./cuda_sample
Using Intel , OpenCL platform: Intel Gen OCL Driver
Using OpenCL device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2
hostFloats[2] 123
hostFloats[2] 222
hostFloats[2] 444

Options

What Coriander provides

compiler for host-side code, including memory allocation, copy, streams, kernel launches
compiler for device-side code, handling templated C++ code, converting it into bog-standard OpenCL 1.2 code
cuBLAS API implementations for GEMM, GEMV, SCAL, SAXPY (using Cedric Nugteren's CLBlast)
cuDNN API implementations for: convolutions (using im2col algorithm over Cedric Nugteren's CLBlast, pooling, ReLU, tanh, and sigmoid

How Coriander works

Kernel compilation proceeds in two steps:

Slides on the IWOCL website, here

Installation

Coriander development is carried out using the following platforms:

Ubuntu 16.04, with:
- NVIDIA K80 GPU and/or NVIDIA K520 GPU (via aws)
Mac Book Pro 4th generation (thank you ASAPP :-) ), with:
- Intel HD Graphics 530
- Radeon Pro 450
- Sierra OS

Other systems should work too, ideally. You will need at a minimum at least one OpenCL-enabled GPU, and appropriate OpenCL drivers installed, for the GPU. Both linux and Mac systems stand a reasonable chance of working ok.

For installation, please see installation

Plugins

You can install the following plugins:

Coriander-clblast: just do cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-clblast
Coriander-dnn: just do cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-dnn
Your plugin here?

How to create a plugin

Add to your own `cmake` project

use cocl_add_executable and cocl_add_library
see cmake usage

Testing

See testing

Assumptions/relaxations made by Coriander

See assumptions

Roadmap

Libraries

Coriander uses the following libraries:

clang/llvm: c/c++ parser/compiler; many contributors
thrust: parallel GPU library, from NVIDIA®
yaml-cpp: yaml for c++, by Jesse Beder
EasyCL: wrapper for OpenCL 1.2 boilerplate
argparsecpp: command-line parser for c++
gtest: unit tests for c++, from Google

Related projects

Eigen-CL: Minimally-tweaked fork of Eigen, for OpenCL 1.2
tf-coriander: Tensorflow for OpenCL-1.2

How to Cite

Please cite: CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++11 applications on OpenCL™ 1.2 Devices

License

Apache 2.0

News

June 23:
- factorized CLBlast implementation of NVIDIA® CUDA™ cuBLAS API, into new plugin coriander-clblast
June 21:
- created a new release v6.0.0, that marks a bunch of changes:
  - incorporates of course the earlier changes:
    - took some big steps towards portability and Windows compilation, ie using python 2.7 scripts, rather than bash scripts, and fixing many Windows-related compilation issues
    - the plugin architecture
    - factorizing the partial NVIDIA® CUDA™ cuDNN API implementation into a new plugin coriander-dnn
  - moved the default installation directory from /usr/local to ~/coriander
    - this means that plugins can be installed without sudo
    - it also makes it relatively easy to wipe and reinstall, for more effective jenkins testing
  - install_distro.py is now considerably more tested than a few days ago, and handles downloading llvm-4.0 automatically
Older news

coriander's People

Contributors

Stargazers

Watchers

Forkers

olveirap sunmoon9898 weimzh honggui liaoheping libardo1 guoyejun cuda-on-cl pxli168 hj3938 tjumick jackerxff benjamesbabala longjohncoder andymaheshw grevutiu-gabriel iamqk mkolod noskcajxela anguoyang c00lrain indianajohn shamjithkv iame6162013 yipsang ttoocs mrgoogol yuchen112358 corona10 bean710 jianbinfang guoyu07 labrem aztecsmith 3d-scan lsm illuhad haoliu1706 dendisuhubdy fitz-nguyen strategist922 joye kolanich-ml tarunkp beaujoh pint1022 l4es zhaojp-frank d-hoke dkkim1005 timerope ardziv sailfish009 ychuan1115 kebin8 karthiks1995 aerosayan manisabri thestarboy blockspacer 5l1v3r1 zeta1999 vaibhavbendre muunetheus ksrikar1234 caiyeahku filmplane maciejmacko dokipen3d mfkiwl icodein artaxerces sycomix 4sbat luisriverag andrewrush tranzmatt forkbackups pavadik alexxxisanchezz tuskaw speitzerpatrick dearborn-open-ai gitkingly chancecardona neonkore craft-zhang

coriander's Issues

runtime_error

got this error running on AMD Radeon Pro 555

_Z5rsqrtd is called, but not defined
This is probalby a bug in Coriander. Please file an issue at https://github.com/hughperkins/coriander/issues/new
basicblockdumper.runGeneration got exception whilst processing:
  %97 = tail call double @_Z5rsqrtd(double %96) #5

generateOpenCL failed to generate opencl sourcecode
kernel name orig=_ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi1ELi1EiEELi16ENS_11MakePointerEEEKNS_19TensorCwiseBinaryOpINS0_20scalar_difference_opIffEEKS8_KNS9_INS0_17scalar_product_opIffEEKNS9_INSD_IKfSF_EEKNS_20TensorBroadcastingOpIKNS_5arrayIiLm1EEEKNS_17TensorReshapingOpIKNS_5SizesIJLl1EEEEKNS4_INS_15TensorFixedSizeISF_NSM_IJEEELi1EiEELi16ES7_EEEEEEKNS4_INS5_ISF_Li1ELi1EiEELi16ES7_EEEEKNS_18TensorCwiseUnaryOpINS0_15scalar_rsqrt_opIfEESC_EEEEEEEENS_9GpuDeviceEEEiEEvT_T0_
kernel name short=_ZN5Eigen8internal15
kernel name unique=_ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi1ELi1EiEELi16ENS_11MakePointerEEEKNS_19TensorCwiseBinaryOpINS0_20scalar_difference_opIffEEKS8_KNS9_INS0_17scalar_product_opIffEEKNS9_INSD_IKfSF_EEKNS_20TensorBroadcastingOpIKNS_5arrayIiLm1EEEKNS_17TensorReshapingOpIKNS_5SizesIJLl1EEEEKNS4_INS_15TensorFixedSizeISF_NSM_IJEEELi1EiEELi16ES7_EEEEEEKNS4_INS5_ISF_Li1ELi1EiEELi16ES7_EEEEKNS_18TensorCwiseUnaryOpINS0_15scalar_rsqrt_opIfEESC_EEEEEEEENS_9GpuDeviceEEEiEEvT_T0__1_2_2_2_2_2
writing ll to /tmp/failed-kernel.ll
caught runtime error _Z5rsqrtd is called, but not defined => cannot continue.  Sorry :-(
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error
Abort trap: 6

On Mac/Radeon, passing multiple buffers from same cudaMalloc into kernel launch causes -4 out of memory

On Mac/Radeon, passing multiple buffers from same cudaMalloc into a kernel launch cause -4 out of memory.

I'm not sure to what extent passing multiple duplicate buffers into an OpenCL kernel launch is/isnt legal, but on Mac/Radeon drivers, it looks like doing so isnt really supported by the driver.

The fix will be to de-duplicate buffers passed into the opencl kernels, which I'll probably look at doing soon-ish.

[arch] [beignet] CommandLine Error: Option 'help-list' registered more than once!

The following error comes up when running make run-tests

I've not tested this on ubuntu and my rx 470 doesn't show this issue.

[       OK ] test_kernel_dumper.test_randomintarray (1 ms)
[----------] 6 tests from test_kernel_dumper (2 ms total)

[----------] 1 test from test_hostside_opencl_funcs
[ RUN      ] test_hostside_opencl_funcs.test_create_cl_kernel
: CommandLine Error: Option 'help-list' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
make[3]: *** [test/gtest/CMakeFiles/run-gtest-tests.dir/build.make:61: test/gtest/CMakeFiles/run-gtest-tests] Error 1
make[2]: *** [CMakeFiles/Makefile2:423: test/gtest/CMakeFiles/run-gtest-tests.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:77: CMakeFiles/run-tests.dir/rule] Error 2
make: *** [Makefile:164: run-tests] Error 2

Failed Kernel in tf-coriander/Keras: 'invalid addrspacecast'

As raised in tf-coriander/#28, in Keras' test-cases I've encountered at least one reproducible segfault, namely:

keras-1.1.1/tests/keras/layers/test_convolutional.py :: test_upsampling_3d

Attached output, after compiling a "spammy" libcocl.so and running with PyTest, is included as well as the failed-kernel file.

conv_upsampling_3d_segfault.tar.gz

CMake error [AMD A10]

CMake Error at CMakeLists.txt:205 (add_library):
  Cannot find source file:

    src/CLBlast/src/database/database.cpp

  Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
  .hxx .in .txx

CMake Error: CMake can not determine linker language for target: clblast
CMake Error: Cannot determine link language for target "clblast".
CMake Error: CMake can not determine linker language for target: clblast

Also would be nice if the third party dependencies are :
a.) Either downloaded when cloning the repo or,
b.) Set as Find_PACKAGE(xxx) so that users know that these additional packages are needed.

(edited by Hugh for formatting)

Enable dynamic memory allocation inside GPU kernels

Hi Hugh,

I ran into some issues when attempting to test a CUDA project that was successfully compiled with Coriander, and I encountered this error message:

_Znwm is called, but not defined
basicblockdumper.runGeneration got exception whilst processing:  %6 = tail call i8* @_Znwm(i64 18112) #11

After some digging, I realized that this was related to the new operator and removed it. I then ran into this error

_ZdaPv is called, but not defined
basicblockdumper.runGeneration got exception whilst processing:  tail call void @_ZdaPv(i8 addrspace(1)* %20) #11

And realized that it was related to the delete operator. Is there any existing support for these operators? Or are there any plans to support them?

cuSolver support

AFAIK there isn't any support in coriander for cuSolver, are there any plans to support cuSolver?

I assume this should be done through a plugin like Coriander-dnn, is this a difficult process?

Document steps need to port cuda only libs (pytorch)

It would be nice to a have step by step documentation on howto convert to/ add support for opencl for a third party lib like pytorch.

py.test output

Ended with segfault
http://pastebin.com/AFayXygn

Also there is 3 cl source files still in /tmp/, but I'm in doubt - is it related to failed test or not.
/tmp/test-device.cl
/tmp/testprog-device.cl
/tmp/test_expression_inlining-device.cl

Support for applications built using CUDA sdk

Hello

Any chance that this wonderful package will support applications built using cuda sdk? For example if we want to test this https://github.com/Celebrandil/CudaSift on opencl using cuda on cl?

Would it be possible?

can't use math function "floor" in a kernel

When using the function "floor" from math.h in a kernel, coriander compiles and link the CUDA code fines, but at run time, the linked binary produces the following error:

OpenCL platform: Intel Gen OCL Driver
OpenCL device: Intel(R) HD Graphics Kabylake ULT GT2
llvm.floor.f32 is called, but not defined
This is probalby a bug in Coriander. Please file an issue at https://github.com/hughperkins/coriander/issues/new
basicblockdumper.runGeneration got exception whilst processing:
%16 = tail call float @llvm.floor.f32(float %15) #6

generateOpenCL failed to generate opencl sourcecode
kernel name orig=_Z6kernelPf
kernel name short=_Z6kernelPf
kernel name unique=_Z6kernelPf_1
writing ll to /tmp/failed-kernel.ll
caught runtime error llvm.floor.f32 is called, but not defined => cannot continue. Sorry :-(
terminate called after throwing an instance of 'std::runtime_error'
what(): llvm.floor.f32 is called, but not defined => cannot continue. Sorry :-(
Aborted (core dumped)

Full CUDA code is in attachment.

testCode.zip

setKernelArgHostsideBuffer in src/hostside_opencl_funcs.cpp its 'documentation' mentions setKernelArgCharStar which no longer exists.

patch_hostside aborts on windows

Greetings,

I know the windows platform isn't entirely supported, but I figured I'd see if you could point me in the right direction for fixing my issue. I was able to compile coriander successfully (with one code tweak to fix a compilation error) and now I'm trying to get cocl_py cuda_sample.cu to run.

I've had to do a few things to get it to run:

I had to pass in the cocl-include flag due to so weird issues with the cocl-include prefix, but I think that's because I built from master instead of win-build because I thought win-build had been merged in.
cocl_py passes in std=c++11 flags to clang, this causes issues with compiling since the c++ sources from Visual Studio are c++14 sources, meaning some of the fuctions in algorithms are now constexpr. I was able to fix these issues by removing the flags.

With these few changes, it starts running, and makes it most of the way through. However, when patch_hostside runs, it exits with a Microsoft Visual C++ Runtime Library Debug error saying abort() had been called. The related output in the log is below:

/coriander/bin\patch_hostside --hostrawfile cuda_sample-hostraw.ll --devicellfile cuda_sample-device.ll --hostpatchedfile cuda_sample-hostpatched.ll While deleting: void (float*, i32, float)* %�?setValue@@YAXPEAMHM@Z Use still stuck around after Def is destroyed: <badref> = bitcast void (float*, i32, float)* @"\01?setValue@@YAXPEAMHM@Z" to i8* Assertion failed: use_empty() && "Uses remain when a value is destroyed!", file E:\Users\welna\Downloads\llvm-4.0.0.src.tar\llvm-4.0.0.src\lib\IR\Value.cpp, line 85 Traceback (most recent call last): File "/coriander/bin/cocl.py", line 384, in <module> '--hostpatchedfile', '%s-hostpatched.ll' % OUTPUTBASEPATH File "/coriander/bin/cocl.py", line 281, in run print(check_output(cmdline_list)) File "/coriander/bin/cocl.py", line 48, in check_output res = subprocess.check_output(cmd_list) File "E:\Users\welna\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 336, in check_output **kwargs).stdout File "E:\Users\welna\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/coriander/bin\\patch_hostside', '--hostrawfile', 'cuda_sample-hostraw.ll', '--devicellfile', 'cuda_sample-device.ll', '--hostpatchedfile', 'cuda_sample-hostpatched.ll']' returned non-zero exit status 3.

Any ideas what the issue could be? If you can point it out, I could probably fiddle around with it and fix it myself, but I have no idea where to begin after googling similar issues since I'm not familiar with the patch_hostside code base.

ModuleNotFoundError: No module named 'cocl_env'

Running
python cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-dnn

Results in

Traceback (most recent call last):
  File "cocl_plugins.py", line 11, in <module>
    from cocl_env import CLANG_HOME, COCL_LIB, COCL_INCLUDE, COCL_INSTALL_PREFIX
ModuleNotFoundError: No module named 'cocl_env'

Occurs on both ubuntu and arch

Unittests test_clone and test_create_cl_kernel fail

Hello,
I ran the unit test and 2 of them fail:

[==========] 79 tests from 13 test cases ran. (570 ms total)
[  PASSED  ] 77 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] test_struct_cloner.test_clone
[  FAILED  ] test_hostside_opencl_funcs.test_create_cl_kernel

test_struct_cloner.test_clone

[ RUN      ] test_struct_cloner.test_clone
structDefinitions struct mystruct {
    int f0;
    float f1;
    global float* f2;
    int f3;
    global float* f4;
};

structDefinition foo {
    int f0;
    float f1;
    global float* f2;
    int f3;
    global float* f4;
};

structNoPtrCl foo {
    int f0;
    float f1;
    int f2;
};

clCopyCode dest.f0 = src.f0;
dest.f1 = src.f1;
dest.f2 = 0;
dest.f3 = src.f2;
dest.f4 = 0;

testIR [; ModuleID = 'hostsideM'
source_filename = "hostsideM"

%"struct mystruct" = type { i32, float, float*, i32, float* }
%"struct mystruct_nopointers" = type { i32, float, i32 }

define void @testfunc() {
entry:
  %0 = alloca %"struct mystruct"
  %1 = alloca %"struct mystruct_nopointers"
  %2 = getelementptr inbounds %"struct mystruct", %"struct mystruct"* %0, i32 0, i32 0
  %3 = getelementptr inbounds %"struct mystruct_nopointers", %"struct mystruct_nopointers"* %1, i32 0, i32 0
  %loadint = load i32, i32* %2
  store volatile i32 %loadint, i32* %3
  %4 = getelementptr inbounds %"struct mystruct", %"struct mystruct"* %0, i32 0, i32 1
  %5 = getelementptr inbounds %"struct mystruct_nopointers", %"struct mystruct_nopointers"* %1, i32 0, i32 1
  %loadint1 = load float, float* %4
  store volatile float %loadint1, float* %5
  %6 = getelementptr inbounds %"struct mystruct", %"struct mystruct"* %0, i32 0, i32 3
  %7 = getelementptr inbounds %"struct mystruct_nopointers", %"struct mystruct_nopointers"* %1, i32 0, i32 2
  %loadint2 = load i32, i32* %6
  store volatile i32 %loadint2, i32* %7
  ret void
}
]
/home/thomas/src/cuda-on-cl/test/gtest/test_struct_cloner.cpp:159: Failure
Value of: testIR
  Actual: "; ModuleID = 'hostsideM'\nsource_filename = \"hostsideM\"\n\n%\"struct mystruct\" = type { i32, float, float*, i32, float* }\n%\"struct mystruct_nopointers\" = type { i32, float, i32 }\n\ndefine void @testfunc() {\nentry:\n  %0 = alloca %\"struct mystruct\"\n  %1 = alloca %\"struct mystruct_nopointers\"\n  %2 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 0\n  %3 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 0\n  %loadint = load i32, i32* %2\n  store volatile i32 %loadint, i32* %3\n  %4 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 1\n  %5 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 1\n  %loadint1 = load float, float* %4\n  store volatile float %loadint1, float* %5\n  %6 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 3\n  %7 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 2\n  %loadint2 = load i32, i32* %6\n  store volatile i32 %loadint2, i32* %7\n  ret void\n}\n"
Expected: expectedIR
Which is: "; ModuleID = 'hostsideM'\n\n%\"struct mystruct\" = type { i32, float, float*, i32, float* }\n%\"struct mystruct_nopointers\" = type { i32, float, i32 }\n\ndefine void @testfunc() {\nentry:\n  %0 = alloca %\"struct mystruct\"\n  %1 = alloca %\"struct mystruct_nopointers\"\n  %2 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 0\n  %3 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 0\n  %loadint = load i32, i32* %2\n  store volatile i32 %loadint, i32* %3\n  %4 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 1\n  %5 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 1\n  %loadint1 = load float, float* %4\n  store volatile float %loadint1, float* %5\n  %6 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 3\n  %7 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 2\n  %loadint2 = load i32, i32* %6\n  store volatile i32 %loadint2, i32* %7\n  ret void\n}\n"
[  FAILED  ] test_struct_cloner.test_clone (1 ms)

test_hostside_opencl_funcs.test_create_cl_kernel

[ RUN      ] test_hostside_opencl_funcs.test_create_cl_kernel
Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing
Using OpenCL device: Ellesmere
building kernel myKernel
 ... built
hostdata[0] 0
/home/thomas/src/cuda-on-cl/test/gtest/test_hostside_opencl_funcs.cpp:55: Failure
Value of: hostdata[0]
  Actual: 0
Expected: 123.0f
Which is: 123
[  FAILED  ] test_hostside_opencl_funcs.test_create_cl_kernel (85 ms)

System info:

kernel: 4.10.10-gentoo
opencl: amdgpu-pro-opencl-16.60.1.379184 on top of amdgpu open source stack
llvm: 3.9.1
harware: AMD RX480

clinfo output:

Platform #0
  Name:                                  AMD Accelerated Parallel Processing
  Version:                               OpenCL 2.0 AMD-APP (2264.10)

  Device #0
    Name:                                Ellesmere
    Type:                                GPU
    Version:                             OpenCL 1.2 AMD-APP (2264.10)
    Global memory size:                  7 GB 149 MB 396 kB 
    Local memory size:                   32 kB 
    Max work group size:                 256
    Max work group size:                 (256, 256, 256)

  Device #1
    Name:                                Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
    Type:                                CPU
    Version:                             OpenCL 1.2 AMD-APP (2264.10)
    Global memory size:                  7 GB 703 MB 352 kB 
    Local memory size:                   32 kB 
    Max work group size:                 1024
    Max work item sizes:                 (1024, 1024, 1024)

How would you read these results ?

Thanks,
Thomas.

Thrust support?

Do you have any plans to get Thrust compiling with cuda-on-cl?

Compatibility with TensorFlow XLA/NVIDIA TensorRT?

This is general question about deep learning inference acceleration with coriander. TF XLA good idea for inference optimization but limited available CUDA. And NVIDIA also release TensorRT as inference optimizer.
Thus anyone try using coriander for TF-XLA and TensorRT?

Building on Mac

Hi Hugh,
Could you please have a look at this one?
hughperkins/tf-coriander#2

I can remove it from the tensorflow-cl repo and paste it all here if you prefer.

porting on aarch64

My board info:
arm64 debian(jessie) , GCC 6.2, LLVM 3.8.

this step is success.
git clone --recursive https://github.com/hughperkins/cuda-on-cl
cd cuda-on-cl
make
sudo make install
make run-test-cocl-cuda_sample

g++ -o build/test-cocl-cuda_sample build/test-cocl-cuda_sample.o -g -lcocl -lOpenCL
/usr/bin/ld: build/test-cocl-cuda_sample.o: relocation R_AARCH64_ADR_PREL_PG_HI21 against external C
/usr/bin/ld: build/test-cocl-cuda_sample.o(.text+0xe8): unresolvable R_AARCH64_ADR_PREL_PG_HI21 rel'
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
Makefile:128: recipe for target 'build/test-cocl-cuda_sample' failed
make: *** [build/test-cocl-cuda_sample] Error 1

sgemm stauts code -2048` for `test_dnn_conv.simple_gpu_conv_backward_filters` [Intel HD530, Mac Sierra]

[ RUN      ] test_dnn_conv.simple_gpu_conv_backward_filters
outH=5 outW=6
forward workspaceSizeBytes=3240
backward filter workspaceSizeBytes=5400
workspaceSizeBytes=5400
CL_GPUOFFSET var detected, changing gpu offset from 0 to 0
Using Apple , OpenCL platform: Apple
Using OpenCL device: Intel(R) HD Graphics 530
gpuMemoryAllocSize=10340
building kernel im2col_kernel
 ... built
sgemm status code -2048
unknown file: Failure
Unknown C++ exception thrown in the test body.

Error code means:

  kKernelLaunchError         = -2048, // Problem occurred when enqueuing the kernel

More Secure Route to Install LLVM 4.0

I was troubled that LLVM:

Has no HTTPS on the release downloads
Offers no SSL-protected way to access secure hashes or GPG keys for verification

For Ubuntu 16.04 I found that the Rust PPA provides LLVM / Clang 4.0, so this might make a useful alternative for installing the toolchains in a secure way:

sudo add-apt-repository ppa:ubuntu-toolchain-r/rust
sudo apt-get update
sudo apt-get install llvm-4.0 llvm-4.0-dev clang-4.0

where can I see the openCL file?

after compile with "cocl”, like “cocl_py cuda_sample.cu”.I only get a binary executable file "cuda_sample".
but is it possible that I could see the openCL source code (.cl file) at somewhere?

Code won't compile on High Sierra

Hi!
I've been having issues with coriander on macOS 10.13. I get this error whenever I try to run the included sample with this repo:

cocl args: cs.cu
LLVM_COMPILE_FLAGS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
+ /usr/local/opt/llvm-4.0/bin/clang++ -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ -D__CUDACC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O2 -S -stdlib=libc++ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/include/EasyCL -I/usr/local/include/cocl -include /usr/local/include/cocl/cocl.h -include /usr/local/include/cocl/fake_funcs.h -include /usr/local/include/cocl/cocl_deviceside.h -I/usr/local/include ./cs.cu -o ./cs-device-noopt.ll
In file included from <built-in>:1:
In file included from /usr/local/include/cocl/cocl.h:10:
In file included from /usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/stdexcept:46:
In file included from /usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/exception:82:
In file included from /usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/cstdlib:86:
/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/stdlib.h:94:15: fatal error: 'stdlib.h' file not found
#include_next <stdlib.h>
              ^~~~~~~~~~
1 error generated.

Any help would be greatly appreciated.
P.S. Running this on a new 15" MacBook Pro, in case it helps :)

Diagnose and/or fix the eigen test failures

Coriander can now be run from within the Eigen test framework, on branch eigen-cl, https://bitbucket.org/eigen/eigen/pull-requests/240/opencl-12-on-mac/diff . The first groups of tests runs successfully :-), ie, you can do:

hg clone https://[email protected]/hughperkins/eigen
cd eigen
mkdir build
cd build
ccmake ..
# press 'c'
# set EIGEN_TEST_CORIANDER to ON
# press 'c' a couple of times, then 'g'
make cxx11_tensor_cuda_1
unsupported/test/cxx11_tensor_cuda_1

However, it would be nice if some more of the tests passed :-) . So there is an opportunity for one or more people to look into why the other tests are failing, and either raise an issue with the root cause(s), and/or fix said root causes :-)

warning: unknown warning option '-Wno-maybe-uninitialized'; did you mean '-Wno-uninitialized'? [-Wunknown-warning-option]

During cocl compilation a warning will crop up, this warning is 'easilly' prevented by adding

LLVM_COMPILE_FLAGS=${LLVM_COMPILE_FLAGS//-Wno-maybe-uninitialized/}

before line 337 in /bin/cocl .

The warning in question:

warning: unknown warning option '-Wno-maybe-uninitialized'; did you mean
'-Wno-uninitialized'? [-Wunknown-warning-option]

compiling error on x86

recently I re-install my laptop(x86_64), following readme to compile the cuda-on-cl, but get the error like this: http://pastebin.ubuntu.com/23475742/.
The same situation in Docker, I am not sure why.

.../cuda-on-cl/bin/cocl: 337: /home/iame/Desktop/c++/cuda-on-cl/bin/cocl: Bad substitution

This is a result of a 'bad' suggestion by me on issue #19 .
I do wish to excuse myself for this.
(I didn't have sh installed, but did have a symlink from sh to bash.)

Build LLVM from source?

Is there an option for this? Your pre-built llvm doesn't play nicely on Centos 7.

[arch] testshfl.cu:50 Assertion `hostFloats1[0] == 1001' failed

make run-testshfl
OpenCL platform: AMD Accelerated Parallel Processing
OpenCL device: Ellesmere
1002
1003
1004
1006
1007
testshfl: /coriander/test/endtoend/testshfl.cu:50: int main(int, char **): Assertion `hostFloats1[0] == 1001' failed.
make[3]: *** [test/endtoend/CMakeFiles/run-testshfl.dir/build.make:61: test/endtoend/CMakeFiles/run-testshfl] Aborted (core dumped)
make[2]: *** [CMakeFiles/Makefile2:2785: test/endtoend/CMakeFiles/run-testshfl.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:627: test/endtoend/CMakeFiles/run-endtoend-tests.dir/rule] Error 2
make: *** [Makefile:307: run-endtoend-tests] Error 2

Just a note for myself

what's the relationship of this cool repos?

Hi hughperkins,
Could you help me understand the relationship of your cool repos: CLBlast, EasyCL, DeepCL,cuda-on-cl ?
thanks!

test failed on master 4f051a4 on iMac

[ RUN ] test_hostside_opencl_funcs.test_create_cl_kernel
Using Apple , OpenCL platform: Apple
Using OpenCL device: Iris Pro
building kernel myKernel
... built
hostdata[0] 0
/Volumes/Data/work/DeepLearning/tensorflow/tensorflow-cl/third_party/cuda-on-cl/test/gtest/test_hostside_opencl_funcs.cpp:55: Failure
Value of: hostdata[0]
Actual: 0
Expected: 123.0f
Which is: 123
[ FAILED ] test_hostside_opencl_funcs.test_create_cl_kernel (1087 ms)
[----------] 1 test from test_hostside_opencl_funcs (1087 ms total)

[----------] Global test environment tear-down
[==========] 57 tests from 9 test cases ran. (1139 ms total)
[ PASSED ] 56 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] test_hostside_opencl_funcs.test_create_cl_kernel

tf `random_op_gpu.cc` fails

tf random_op_gpu.cc fails.

Example output https://gist.github.com/hughperkins/a41080ce90bca976ba37d2e97cc868c4

There are a few issues:

detection of by-value structs is unreliable (fixed in eade62d)
parentheses not added around == expressions (fixed in 5e167c8 )
float constants that trigger scientific notation (eg 1e-7) are being displayed incorrectly (eg 1e-7.0f) . (fixed in 38dc48b )
(New!) _Z3logf not implemented, see below . (fixed in d000b5f)
(New!) readnone kernel parameters cause calling inconsistencies (fixed in eae83fd )

(New!) Need to implement umulhi, currently stubbed...

coriander/src/new_instruction_dumper.cpp

Lines 1013 to 1016 in 2faaa5e

    
           } else if(functionName == "_Z8__umulhiii") { 
        
               localValueInfo->setAddressSpace(0); 
        
               localValueInfo->setExpression("0"); 
        
               return;

. (Implemented in 1d9e0e8 )

(New!) Missing _Z6memcpyPvPKvm implementation, on device-side . (implemented in c24f828 )

To reproduce the issue:

install tensorflow-cl, branch llvm-40-addingconv, https://github.com/hughperkins/tensorflow-cl/tree/llvm-4.0-adding-conv
run python tensorflow/stream_executor/cl/test/test_misc.py

Relevant bytecode:

https://github.com/hughperkins/coriander/blob/master/test/tf/random_op_gpu-device-noopt.ll
plus other files, with similar name prefix

Example generated OpenCL:

https://gist.github.com/hughperkins/b382f0ddf6cbb88f9b5e1347bea0f959

Example of _Z3logf not implemented issue. Output during running the test_misc.py script:

__internal__ build log: 
<program source>:440:10: error: assigning to 'float' from incompatible type 'void'
    v245 = _Z3logf((v224 < 1e-07f) ? 1e-07f : v224, scratch);
         ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<program source>:476:10: error: assigning to 'float' from incompatible type 'void'
    v277 = _Z3logf((v259 < 1e-07f) ? 1e-07f : v259, scratch);
         ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_Z6memcpyPvPKvm: in the OpenCL, we can see:

_Z6memcpyPvPKvm(v65, v64, 4, scratch);

and

void _Z6memcpyPvPKvm(char* v1, char* v2, long v3, local int *scratch) {
}

... so the declaration of _Z6memcpyPvPKvm is being treated as an implementation, and then called. We should at least replace this implementation with something workable. It might be ideal if missing implementations raised an excetpion perhaps.

../get-llvm-cxxflags.sh: line 9: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory

I'm reporting this from arch linux.

In the erroring scripts CLANG_HOME is used.
I think the errors are due to CLANG_HOME pointing to "/usr/local/opt/llvm-4.0" even though since v6.0.0 the default installation directory moved from /usr/local to ~/coriander

-- The C compiler identification is GNU 7.1.1
-- The CXX compiler identification is GNU 7.1.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
/coriander/cmake/get-llvm-cxxflags.sh: line 9: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory
/coriander/cmake/llvm-syslibs.sh: line 4: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory
/coriander/cmake/get-llvm-libs.sh: line 6: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory
-- Configuring done
-- Generating done
-- Build files have been written to: /coriander/build

CUDA code won't run

I tried compiling the following CUDA code with cocl cudatest.cu on a macOS Sierra machine with a GeForce GTX 1070:

#include <stdio.h>

#define SIZE 1000

__global__ void kernel_matrix_add(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] + input2[idx];
}

__global__ void kernel_matrix_multiply(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] * input2[idx];
}

__global__ void kernel_matrix_divide(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] / input2[idx];
}

__global__ void kernel_matrix_subtract(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] - input2[idx];
}


int main() {

        float * in1;
        float * in2;
        float * out;

        cudaHostAlloc(&in1, SIZE*sizeof(float), cudaHostAllocDefault);
        cudaHostAlloc(&in2, SIZE*sizeof(float), cudaHostAllocDefault);
        cudaHostAlloc(&out, SIZE*sizeof(float), cudaHostAllocDefault);

        for (int i = 0; i < SIZE; ++i) {
                in1[i] = i;
                in2[i] = i;
                out[i] = 0;
        }

        float * d_in1;
        float * d_in2;
        float * d_out;

        cudaMalloc(&d_in1, SIZE*sizeof(float));
        cudaMalloc(&d_in2, SIZE*sizeof(float));
        cudaMalloc(&d_out, SIZE*sizeof(float));

        cudaMemcpy(d_in1, in1, SIZE*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_in2, in2, SIZE*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_out, out, SIZE*sizeof(float), cudaMemcpyHostToDevice);

        kernel_matrix_multiply<<<SIZE / 1024 + 1, 1024>>>(d_in1, d_in2, d_out);

        cudaMemcpy(out, d_out, SIZE*sizeof(float), cudaMemcpyHostToDevice);

        printf("First 10 Results:\n");

        for (int i = 0; i < 10; ++i) {
                printf("%f\n", out[i]);
        }

}

But I got the following error:

cocl args: cudatest.cu
LLVM_COMPILE_FLAGS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
+ /usr/local/opt/llvm-4.0/bin/clang++ -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ -D__CUDACC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O2 -S -stdlib=libc++ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/include/EasyCL -I/usr/local/include/cocl -include /usr/local/include/cocl/cocl.h -include /usr/local/include/cocl/fake_funcs.h -include /usr/local/include/cocl/cocl_deviceside.h -I/usr/local/include ./cudatest.cu -o ./cudatest-device-noopt.ll
./cudatest.cu:41:42: error: use of undeclared identifier 'cudaHostAllocDefault'
        cudaHostAlloc(&in1, SIZE*sizeof(float), cudaHostAllocDefault);
                                                ^
./cudatest.cu:42:42: error: use of undeclared identifier 'cudaHostAllocDefault'
        cudaHostAlloc(&in2, SIZE*sizeof(float), cudaHostAllocDefault);
                                                ^
./cudatest.cu:43:42: error: use of undeclared identifier 'cudaHostAllocDefault'
        cudaHostAlloc(&out, SIZE*sizeof(float), cudaHostAllocDefault);
                                                ^
3 errors generated.

Is there a limitation to coriander I'm not aware of? How can I make this compatible?

Thanks!

Have you seen amd hipify?

https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/clang-hipify/README.md

Translate CUDA directly to SPIR omitting OpenCL code

There is a tool to produce LLVM IR from CUDA source https://developer.nvidia.com/cuda-llvm-compiler
And a tool to convert LLVM IR -> SPIR https://github.com/KhronosGroup/SPIRV-LLVM

Why not to use this combo? (I would try it by myself, but I think you may did it already.)

Cannot find -ltinfo

During installation (running install_distro.py) I get the following error:

https://gist.github.com/anonymous/eb8ee11cb12058863a903eeecd23cf74

I am running the latest Ubuntu

Edit: Fixed. It seems that I was missing some library.
I fixed using:
sudo apt-get install libtinfo-dev

`global **` kernel parameters not supported

See hughperkins/tf-coriander#33 , currently global ** parameters, in byvalue structs, are not handled by Coriander. I think they should be.

Relevant bytecode: files prefixed split_lib_gpu in https://github.com/hughperkins/coriander/tree/895934eff7bdfe939cb62ea1f9ad7e3a06b3c8d5/test/tf

Example of current generated opencl: https://gist.github.com/hughperkins/becd78a579fd17d0fc7aeb277d5cee68

struct tensorflow__CudaDeviceArrayStruct {
    int f0;
    float* f1[8];
    global float** f2;
};

kernel void _ZN10tensorflow12_GL(global char* clmem0, global char* clmem1, uint v9_offset, int v10, int v11, int v12, uint v13_offset, local int *scratch);

kernel void _ZN10tensorflow12_GL(global char* clmem0, global char* clmem1, uint v9_offset, int v10, int v11, int v12, uint v13_offset, local int *scratch) {
    global struct tensorflow__CudaDeviceArrayStruct* v13 = (global struct tensorflow__CudaDeviceArrayStruct*)(clmem1 + v13_offset);
    global float* v9 = (global float*)(clmem0 + v9_offset);
...

As for how to handle this ... :-P . ???

Add batch normalization

Would be very cool to have batch normalization enabled. There is a short window of opportunity for someone to implement this, whilst I'm working on fixing split, and conv.

cocl works, cocl_py doesn't

I am using Python 2.7.10 on MacOS and find that cocl_py does not work in the way that cocl does. Is there something horribly wrong with my Python environment? I do not do anything to it, other than install Python 2 and 3 from Homebrew, should not be used here.

Thanks for your help. This is an awesome project.

Error

/opt/llvm/cocl/bin/cocl.py -g -O3 -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 nstream-cuda.cu -o nstream-cuda
Traceback (most recent call last):
  File "/opt/llvm/cocl/bin/cocl.py", line 114, in <module>
    DSTRIPPED = THISARG.REPLACE('-D-D', '-D')
AttributeError: 'str' object has no attribute 'REPLACE'

Working

/opt/llvm/cocl/bin/cocl -g -O3 -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION="2.16" nstream-cuda.cu -o nstream-cuda

WARNING: `cocl` is deprecated, and will be removed in a future release
Please use: `cocl_py`, which is easier to maintain, and portable

cocl args: -g -O3 -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 nstream-cuda.cu -o nstream-cuda
LLVM_COMPILE_FLAGS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/clang++ -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ -D__CUDACC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O2 -S -stdlib=libc++ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/cocl/include/EasyCL -I/opt/llvm/cocl/include/cocl -include /opt/llvm/cocl/include/cocl/cocl.h -include /opt/llvm/cocl/include/cocl/fake_funcs.h -include /opt/llvm/cocl/include/cocl/cocl_deviceside.h -I/opt/llvm/cocl/include ./nstream-cuda.cu -o ./nstream-cuda-device-noopt.ll
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/opt -inline -mem2reg -instcombine -S -o ./nstream-cuda-device.ll ./nstream-cuda-device-noopt.ll
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/clang++ -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 -DUSE_CLEW -std=c++11 -x cuda -nocudainc --cuda-host-only -emit-llvm -O2 -S -g -D__CUDACC__ -D__CORIANDERCC__ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/cocl/include -I/opt/llvm/cocl/include/EasyCL -I/opt/llvm/cocl/include/cocl -stdlib=libc++ -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -include /opt/llvm/cocl/include/cocl/cocl.h -include /opt/llvm/cocl/include/cocl/fake_funcs.h -include /opt/llvm/cocl/include/cocl/cocl_hostside.h ./nstream-cuda.cu -o ./nstream-cuda-hostraw.ll
+ /opt/llvm/cocl/bin/patch_hostside --hostrawfile ./nstream-cuda-hostraw.ll --devicellfile ./nstream-cuda-device.ll --hostpatchedfile ./nstream-cuda-hostpatched.ll
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/clang++ -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -DUSE_CLEW -O3 -g -c ./nstream-cuda-hostpatched.ll -o ./nstream-cuda.o
+ clang++ -Wl,-rpath,/opt/llvm/cocl/lib -Wl,-rpath,59357ORIGIN -g -o ./nstream-cuda ./nstream-cuda.o -L/opt/llvm/cocl/lib -lcocl -lclblast -leasycl -lclew -lpthread -L/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/lib -Wl,-search_paths_first -Wl,-headerpad_max_install_names -lLLVMLTO -lLLVMPasses -lLLVMObjCARCOpts -lLLVMMIRParser -lLLVMSymbolize -lLLVMDebugInfoPDB -lLLVMDebugInfoDWARF -lLLVMCoverage -lLLVMTableGen -lLLVMOrcJIT -lLLVMXCoreDisassembler -lLLVMXCoreCodeGen -lLLVMXCoreDesc -lLLVMXCoreInfo -lLLVMXCoreAsmPrinter -lLLVMSystemZDisassembler -lLLVMSystemZCodeGen -lLLVMSystemZAsmParser -lLLVMSystemZDesc -lLLVMSystemZInfo -lLLVMSystemZAsmPrinter -lLLVMSparcDisassembler -lLLVMSparcCodeGen -lLLVMSparcAsmParser -lLLVMSparcDesc -lLLVMSparcInfo -lLLVMSparcAsmPrinter -lLLVMRISCVDesc -lLLVMRISCVCodeGen -lLLVMRISCVInfo -lLLVMPowerPCDisassembler -lLLVMPowerPCCodeGen -lLLVMPowerPCAsmParser -lLLVMPowerPCDesc -lLLVMPowerPCInfo -lLLVMPowerPCAsmPrinter -lLLVMNVPTXCodeGen -lLLVMNVPTXDesc -lLLVMNVPTXInfo -lLLVMNVPTXAsmPrinter -lLLVMMSP430CodeGen -lLLVMMSP430Desc -lLLVMMSP430Info -lLLVMMSP430AsmPrinter -lLLVMMipsDisassembler -lLLVMMipsCodeGen -lLLVMMipsAsmParser -lLLVMMipsDesc -lLLVMMipsInfo -lLLVMMipsAsmPrinter -lLLVMLanaiDisassembler -lLLVMLanaiCodeGen -lLLVMLanaiAsmParser -lLLVMLanaiDesc -lLLVMLanaiInstPrinter -lLLVMLanaiInfo -lLLVMHexagonDisassembler -lLLVMHexagonCodeGen -lLLVMHexagonAsmParser -lLLVMHexagonDesc -lLLVMHexagonInfo -lLLVMBPFDisassembler -lLLVMBPFCodeGen -lLLVMBPFDesc -lLLVMBPFInfo -lLLVMBPFAsmPrinter -lLLVMARMDisassembler -lLLVMARMCodeGen -lLLVMARMAsmParser -lLLVMARMDesc -lLLVMARMInfo -lLLVMARMAsmPrinter -lLLVMAMDGPUDisassembler -lLLVMAMDGPUCodeGen -lLLVMAMDGPUAsmParser -lLLVMAMDGPUDesc -lLLVMAMDGPUInfo -lLLVMAMDGPUAsmPrinter -lLLVMAMDGPUUtils -lLLVMAArch64Disassembler -lLLVMAArch64CodeGen -lLLVMAArch64AsmParser -lLLVMAArch64Desc -lLLVMAArch64Info -lLLVMAArch64AsmPrinter -lLLVMAArch64Utils -lLLVMObjectYAML -lLLVMLibDriver -lLLVMOption -lLLVMX86Disassembler -lLLVMX86AsmParser -lLLVMX86CodeGen -lLLVMGlobalISel -lLLVMSelectionDAG -lLLVMAsmPrinter -lLLVMDebugInfoCodeView -lLLVMDebugInfoMSF -lLLVMX86Desc -lLLVMMCDisassembler -lLLVMX86Info -lLLVMX86AsmPrinter -lLLVMX86Utils -lLLVMMCJIT -lLLVMLineEditor -lLLVMInterpreter -lLLVMExecutionEngine -lLLVMRuntimeDyld -lLLVMCodeGen -lLLVMTarget -lLLVMCoroutines -lLLVMipo -lLLVMInstrumentation -lLLVMVectorize -lLLVMScalarOpts -lLLVMLinker -lLLVMIRReader -lLLVMAsmParser -lLLVMInstCombine -lLLVMTransformUtils -lLLVMBitWriter -lLLVMAnalysis -lLLVMObject -lLLVMMCParser -lLLVMMC -lLLVMBitReader -lLLVMProfileData -lLLVMCore -lLLVMSupport -lLLVMDemangle -lcurses -lz -lm

System Info

$ which python
/usr/bin/python
$ python --version
Python 2.7.10
$ uname -prsv
Darwin 16.7.0 Darwin Kernel Version 16.7.0: Mon Nov 13 21:56:25 PST 2017; root:xnu-3789.72.11~1/RELEASE_X86_64 i386

Passing double to kernel by value fails

Passing a double value to a kernel by value like this:

__global__ void foo(double* A, double x)
{
  int  i = threadIdx.x;
  A[i] = x;
}

Results in the following error when compiling with cocl:

error: invalid forward reference to function 'setKernelArgFloat' with wrong type!
  call void @setKernelArgFloat(double %loadCudaArg1)
            ^
1 error generated.

Update to llvm 4.0?

LLVM 4.0 was released 13 March 2017.
Using an outdated llvm seems counter productive.
Especially since it isn't even llvm 3.9, the last of it's 'version'.

Deep learning framework support with CUDA backend

Since Coriander can support CUDA on OpenCL, is there any success story already to make it work with CUDA backend of some deep learning framework such as TensorFlow and Caffe on non-CUDA platform?

[...] Eigen is a pre-requisite for running Eigen.

In: coriander/test/eigen/README.md
We see the following sentence:

Eigen is used by Tensorflow, so being able to build and run Eigen is a pre-requisite for running Eigen.

Surely you mean:

Eigen is used by Tensorflow, so being able to build and run Eigen is a pre-requisite for running Tensorflow.

If you want me to make a PR for it, ask me. :)

bad_alloc when launch cuda kernel, [PowerVR Rogue G6230]

When run cuda_sample or other test applications, there is some error below, which happened at the time launching cuda kernel (i.e, setValue in cuda_sample).

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Stack dump:
0.	<eof> parser at end of file

There is no useful info in the backtrace:
https://gist.github.com/JammyZhou/23782a6764ea2879dd179fff37fb64c0

After enable COCL_SPAM, the log is below:

OpenCL platform: PowerVR Rogue
OpenCL device: PowerVR Rogue G6230
setKernelArgInt32 2
setKernelArgFloat 123
kernelGo() kernel: _Z8setValuePfif
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Stack dump:
0.	<eof> parser at end of file

Arch support

This is mainly to track what still has to be done to install this on arch linux.

Arch users don't have the libtinfo5 package installed, yet it is required.
Installing: https://aur.archlinux.org/packages/libtinfo5/ doesn't work.
The gathered llvm_version from ~/coriander/soft/llvm-4.0/bin/clang++ --version is incorrect('information' instead of 4.0.0). lol

~/coriander/soft/llvm-4.0/bin/clang++ --version results in:

/usr/lib/libtinfo.so.5: no version information available (required by /home/iame/coriander/soft/llvm-4.0/bin/clang++)
clang version 4.0.0 (tags/RELEASE_400/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/iame/coriander/soft/llvm-4.0/bin

I'll supply you with the required PRs to make it work on arch linux :)

Build failure for aarch64

When tried to build coriander on my aarch64 board, I ran into some error below, and this conditional branch issue seems common for some other projects as well on aarch64.

[ 2%] Built target clew
[ 2%] Building CXX object CMakeFiles/clblast.dir/src/CLBlast/src/database/database.cpp.o

/tmp/ccitmCEV.s: Assembler messages:
/tmp/ccitmCEV.s:22989: Error: conditional branch out of range
/tmp/ccitmCEV.s:22996: Error: conditional branch out of range
CMakeFiles/clblast.dir/build.make:54: recipe for target 'CMakeFiles/clblast.dir/src/CLBlast/src/database/database.cpp.o' failed
make[2]: *** [CMakeFiles/clblast.dir/src/CLBlast/src/database/database.cpp.o] Error 1
CMakeFiles/Makefile2:1980: recipe for target 'CMakeFiles/clblast.dir/all' failed
make[1]: *** [CMakeFiles/clblast.dir/all] Error 2
Makefile:117: recipe for target 'all' failed
make: *** [all] Error 2

Permission denied (public key)

I don't know if it was a problem with shh (the internet I'm using has had problems) or something else, but I can't download the CLBlast submodule using the python install.
I've tried to manually download your CLBlast into the /src folder but it doesn't seem to notice it.

I am running Ubuntu 16.04

The exact errors are:

/usr/bin/python /home/alex/coriander/bin/cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-clblast --git-branch master
Cloning into 'coriander-clblast'...
Cloning into 'src/CLBlast'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:hughperkins/CLBlast' into submodule path 'src/CLBlast' failed
('CLANG_HOME', '/home/alex/coriander/soft/llvm-4.0')
('COCL_INSTALL_PREFIX', '/home/alex/coriander')
('COCL_INCLUDE', '/home/alex/coriander/include')
('COCL_LIB', '/home/alex/coriander/lib')
plugin_name [coriander-clblast]

Traceback (most recent call last):
File "/home/alex/coriander/bin/cocl_plugins.py", line 114, in
func(**args_dict)
File "/home/alex/coriander/bin/cocl_plugins.py", line 80, in install
], cwd=git_dir))
File "/home/alex/coriander/bin/cocl_plugins.py", line 26, in check_output
res = subprocess.check_output(cmdlist, cwd=cwd)
File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['git', 'clone', '--recursive', 'https://github.com/hughperkins/coriander-clblast', '-b', 'master', 'coriander-clblast']' returned non-zero exit status 128
p.returncode 1
Traceback (most recent call last):
File "install_distro.py", line 221, in
main(**args.dict)
File "install_distro.py", line 212, in main
install_plugin(install_dir=install_dir, repo_url=repo_url, git_branch=git_branch)
File "install_distro.py", line 199, in install_plugin
'--git-branch', git_branch
File "install_distro.py", line 88, in run
assert p.returncode == 0
AssertionError

Run-tests Segmentation fault

https://gist.github.com/DarkTachyon/ec7654d19a5c27c012c453b430e826fd

Error on python build script

When running python install_distro.py on macOS Sierra 10.12.5 I get:

cd to [/Users/laptop/coriander]
cd to [/Users/laptop/coriander/soft]
wget --progress=dot:giga http://releases.llvm.org/4.0.0/clang+llvm-4.0.0-x86_64-apple-darwin.tar.xz -O clang+llvm-4.0.0-x86_64-apple-darwin.tar.xz
Traceback (most recent call last):
File "install_distro.py", line 193, in
main(**args.dict)
File "install_distro.py", line 177, in main
maybe_install_llvm(install_dir=install_dir)
File "install_distro.py", line 138, in maybe_install_llvm
install_llvm(install_dir)
File "install_distro.py", line 113, in install_llvm
run(['wget', '--progress=dot:giga', target_url, '-O', filename])
File "install_distro.py", line 51, in run
p = subprocess.Popen(cmdlist, cwd=current_dir, stdout=f_out, stderr=subprocess.STDOUT, bufsize=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

	} else if(functionName == "_Z8__umulhiii") {
	localValueInfo->setAddressSpace(0);
	localValueInfo->setExpression("0");
	return;