Giter Site home page Giter Site logo

hughperkins / coriander Goto Github PK

View Code? Open in Web Editor NEW
832.0 48.0 88.0 7.96 MB

Build NVIDIA® CUDA™ code for OpenCL™ 1.2 devices

License: Apache License 2.0

Shell 1.08% C++ 33.51% Python 7.05% LLVM 46.95% C 2.42% Cuda 7.08% CMake 1.85% Dockerfile 0.05%
opencl gpu coriander mac radeon intel amd nvidia ubuntu

coriander's Introduction

Coriander

Build applications written in NVIDIA® CUDA™ code for OpenCL™ 1.2 devices.

Concept

  • leave applications in NVIDIA® CUDA™
  • compile into OpenCL 1.2
  • run on any OpenCL 1.2 GPU

How to use

  • Write an NVIDIA® CUDA™ sourcecode file, or find an existing one
  • Let's use cuda_sample.cu
  • Compile, using cocl:
$ cocl_py cuda_sample.cu
   ...
   ... (bunch of compily stuff) ...
   ...

    ./cuda_sample.cu compiled into ./cuda_sample

Run:

$ ./cuda_sample
Using Intel , OpenCL platform: Intel Gen OCL Driver
Using OpenCL device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2
hostFloats[2] 123
hostFloats[2] 222
hostFloats[2] 444

Options

What Coriander provides

  • compiler for host-side code, including memory allocation, copy, streams, kernel launches
  • compiler for device-side code, handling templated C++ code, converting it into bog-standard OpenCL 1.2 code
  • cuBLAS API implementations for GEMM, GEMV, SCAL, SAXPY (using Cedric Nugteren's CLBlast)
  • cuDNN API implementations for: convolutions (using im2col algorithm over Cedric Nugteren's CLBlast, pooling, ReLU, tanh, and sigmoid

How Coriander works

Kernel compilation proceeds in two steps:

Slides on the IWOCL website, here

Installation

Coriander development is carried out using the following platforms:

  • Ubuntu 16.04, with:
    • NVIDIA K80 GPU and/or NVIDIA K520 GPU (via aws)
  • Mac Book Pro 4th generation (thank you ASAPP :-) ), with:
    • Intel HD Graphics 530
    • Radeon Pro 450
    • Sierra OS

Other systems should work too, ideally. You will need at a minimum at least one OpenCL-enabled GPU, and appropriate OpenCL drivers installed, for the GPU. Both linux and Mac systems stand a reasonable chance of working ok.

For installation, please see installation

Plugins

You can install the following plugins:

  • Coriander-clblast: just do cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-clblast
  • Coriander-dnn: just do cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-dnn
  • Your plugin here?

How to create a plugin

Add to your own cmake project

  • use cocl_add_executable and cocl_add_library
  • see cmake usage

Testing

See testing

Assumptions/relaxations made by Coriander

See assumptions

Roadmap

Libraries

Coriander uses the following libraries:

  • clang/llvm: c/c++ parser/compiler; many contributors
  • thrust: parallel GPU library, from NVIDIA®
  • yaml-cpp: yaml for c++, by Jesse Beder
  • EasyCL: wrapper for OpenCL 1.2 boilerplate
  • argparsecpp: command-line parser for c++
  • gtest: unit tests for c++, from Google

Related projects

How to Cite

Please cite: CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++11 applications on OpenCL™ 1.2 Devices

License

Apache 2.0

News

  • June 23:
    • factorized CLBlast implementation of NVIDIA® CUDA™ cuBLAS API, into new plugin coriander-clblast
  • June 21:
    • created a new release v6.0.0, that marks a bunch of changes:
      • incorporates of course the earlier changes:
        • took some big steps towards portability and Windows compilation, ie using python 2.7 scripts, rather than bash scripts, and fixing many Windows-related compilation issues
        • the plugin architecture
        • factorizing the partial NVIDIA® CUDA™ cuDNN API implementation into a new plugin coriander-dnn
      • moved the default installation directory from /usr/local to ~/coriander
        • this means that plugins can be installed without sudo
        • it also makes it relatively easy to wipe and reinstall, for more effective jenkins testing
      • install_distro.py is now considerably more tested than a few days ago, and handles downloading llvm-4.0 automatically
  • Older news

coriander's People

Contributors

dokipen3d avatar emogenet avatar guoyejun avatar hughperkins avatar iame6162013 avatar indianajohn avatar kolanich avatar weimzh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coriander's Issues

runtime_error

got this error running on AMD Radeon Pro 555

_Z5rsqrtd is called, but not defined
This is probalby a bug in Coriander. Please file an issue at https://github.com/hughperkins/coriander/issues/new
basicblockdumper.runGeneration got exception whilst processing:
  %97 = tail call double @_Z5rsqrtd(double %96) #5

generateOpenCL failed to generate opencl sourcecode
kernel name orig=_ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi1ELi1EiEELi16ENS_11MakePointerEEEKNS_19TensorCwiseBinaryOpINS0_20scalar_difference_opIffEEKS8_KNS9_INS0_17scalar_product_opIffEEKNS9_INSD_IKfSF_EEKNS_20TensorBroadcastingOpIKNS_5arrayIiLm1EEEKNS_17TensorReshapingOpIKNS_5SizesIJLl1EEEEKNS4_INS_15TensorFixedSizeISF_NSM_IJEEELi1EiEELi16ES7_EEEEEEKNS4_INS5_ISF_Li1ELi1EiEELi16ES7_EEEEKNS_18TensorCwiseUnaryOpINS0_15scalar_rsqrt_opIfEESC_EEEEEEEENS_9GpuDeviceEEEiEEvT_T0_
kernel name short=_ZN5Eigen8internal15
kernel name unique=_ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi1ELi1EiEELi16ENS_11MakePointerEEEKNS_19TensorCwiseBinaryOpINS0_20scalar_difference_opIffEEKS8_KNS9_INS0_17scalar_product_opIffEEKNS9_INSD_IKfSF_EEKNS_20TensorBroadcastingOpIKNS_5arrayIiLm1EEEKNS_17TensorReshapingOpIKNS_5SizesIJLl1EEEEKNS4_INS_15TensorFixedSizeISF_NSM_IJEEELi1EiEELi16ES7_EEEEEEKNS4_INS5_ISF_Li1ELi1EiEELi16ES7_EEEEKNS_18TensorCwiseUnaryOpINS0_15scalar_rsqrt_opIfEESC_EEEEEEEENS_9GpuDeviceEEEiEEvT_T0__1_2_2_2_2_2
writing ll to /tmp/failed-kernel.ll
caught runtime error _Z5rsqrtd is called, but not defined => cannot continue.  Sorry :-(
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error
Abort trap: 6

On Mac/Radeon, passing multiple buffers from same cudaMalloc into kernel launch causes -4 out of memory

On Mac/Radeon, passing multiple buffers from same cudaMalloc into a kernel launch cause -4 out of memory.

I'm not sure to what extent passing multiple duplicate buffers into an OpenCL kernel launch is/isnt legal, but on Mac/Radeon drivers, it looks like doing so isnt really supported by the driver.

The fix will be to de-duplicate buffers passed into the opencl kernels, which I'll probably look at doing soon-ish.

[arch] [beignet] CommandLine Error: Option 'help-list' registered more than once!

The following error comes up when running make run-tests

I've not tested this on ubuntu and my rx 470 doesn't show this issue.

[       OK ] test_kernel_dumper.test_randomintarray (1 ms)
[----------] 6 tests from test_kernel_dumper (2 ms total)

[----------] 1 test from test_hostside_opencl_funcs
[ RUN      ] test_hostside_opencl_funcs.test_create_cl_kernel
: CommandLine Error: Option 'help-list' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
make[3]: *** [test/gtest/CMakeFiles/run-gtest-tests.dir/build.make:61: test/gtest/CMakeFiles/run-gtest-tests] Error 1
make[2]: *** [CMakeFiles/Makefile2:423: test/gtest/CMakeFiles/run-gtest-tests.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:77: CMakeFiles/run-tests.dir/rule] Error 2
make: *** [Makefile:164: run-tests] Error 2

CMake error [AMD A10]

CMake Error at CMakeLists.txt:205 (add_library):
  Cannot find source file:

    src/CLBlast/src/database/database.cpp

  Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
  .hxx .in .txx

CMake Error: CMake can not determine linker language for target: clblast
CMake Error: Cannot determine link language for target "clblast".
CMake Error: CMake can not determine linker language for target: clblast

Also would be nice if the third party dependencies are :
a.) Either downloaded when cloning the repo or,
b.) Set as Find_PACKAGE(xxx) so that users know that these additional packages are needed.

(edited by Hugh for formatting)

Enable dynamic memory allocation inside GPU kernels

Hi Hugh,

I ran into some issues when attempting to test a CUDA project that was successfully compiled with Coriander, and I encountered this error message:

_Znwm is called, but not defined
basicblockdumper.runGeneration got exception whilst processing:  %6 = tail call i8* @_Znwm(i64 18112) #11

After some digging, I realized that this was related to the new operator and removed it. I then ran into this error

_ZdaPv is called, but not defined
basicblockdumper.runGeneration got exception whilst processing:  tail call void @_ZdaPv(i8 addrspace(1)* %20) #11

And realized that it was related to the delete operator. Is there any existing support for these operators? Or are there any plans to support them?

cuSolver support

AFAIK there isn't any support in coriander for cuSolver, are there any plans to support cuSolver?

I assume this should be done through a plugin like Coriander-dnn, is this a difficult process?

py.test output

Ended with segfault
http://pastebin.com/AFayXygn

Also there is 3 cl source files still in /tmp/, but I'm in doubt - is it related to failed test or not.
/tmp/test-device.cl
/tmp/testprog-device.cl
/tmp/test_expression_inlining-device.cl

can't use math function "floor" in a kernel

When using the function "floor" from math.h in a kernel, coriander compiles and link the CUDA code fines, but at run time, the linked binary produces the following error:

OpenCL platform: Intel Gen OCL Driver
OpenCL device: Intel(R) HD Graphics Kabylake ULT GT2
llvm.floor.f32 is called, but not defined
This is probalby a bug in Coriander. Please file an issue at https://github.com/hughperkins/coriander/issues/new
basicblockdumper.runGeneration got exception whilst processing:
%16 = tail call float @llvm.floor.f32(float %15) #6

generateOpenCL failed to generate opencl sourcecode
kernel name orig=_Z6kernelPf
kernel name short=_Z6kernelPf
kernel name unique=_Z6kernelPf_1
writing ll to /tmp/failed-kernel.ll
caught runtime error llvm.floor.f32 is called, but not defined => cannot continue. Sorry :-(
terminate called after throwing an instance of 'std::runtime_error'
what(): llvm.floor.f32 is called, but not defined => cannot continue. Sorry :-(
Aborted (core dumped)

Full CUDA code is in attachment.

testCode.zip

patch_hostside aborts on windows

Greetings,

I know the windows platform isn't entirely supported, but I figured I'd see if you could point me in the right direction for fixing my issue. I was able to compile coriander successfully (with one code tweak to fix a compilation error) and now I'm trying to get cocl_py cuda_sample.cu to run.

I've had to do a few things to get it to run:

  1. I had to pass in the cocl-include flag due to so weird issues with the cocl-include prefix, but I think that's because I built from master instead of win-build because I thought win-build had been merged in.

  2. cocl_py passes in std=c++11 flags to clang, this causes issues with compiling since the c++ sources from Visual Studio are c++14 sources, meaning some of the fuctions in algorithms are now constexpr. I was able to fix these issues by removing the flags.

With these few changes, it starts running, and makes it most of the way through. However, when patch_hostside runs, it exits with a Microsoft Visual C++ Runtime Library Debug error saying abort() had been called. The related output in the log is below:

/coriander/bin\patch_hostside --hostrawfile cuda_sample-hostraw.ll --devicellfile cuda_sample-device.ll --hostpatchedfile cuda_sample-hostpatched.ll While deleting: void (float*, i32, float)* %�?setValue@@YAXPEAMHM@Z Use still stuck around after Def is destroyed: <badref> = bitcast void (float*, i32, float)* @"\01?setValue@@YAXPEAMHM@Z" to i8* Assertion failed: use_empty() && "Uses remain when a value is destroyed!", file E:\Users\welna\Downloads\llvm-4.0.0.src.tar\llvm-4.0.0.src\lib\IR\Value.cpp, line 85 Traceback (most recent call last): File "/coriander/bin/cocl.py", line 384, in <module> '--hostpatchedfile', '%s-hostpatched.ll' % OUTPUTBASEPATH File "/coriander/bin/cocl.py", line 281, in run print(check_output(cmdline_list)) File "/coriander/bin/cocl.py", line 48, in check_output res = subprocess.check_output(cmd_list) File "E:\Users\welna\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 336, in check_output **kwargs).stdout File "E:\Users\welna\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/coriander/bin\\patch_hostside', '--hostrawfile', 'cuda_sample-hostraw.ll', '--devicellfile', 'cuda_sample-device.ll', '--hostpatchedfile', 'cuda_sample-hostpatched.ll']' returned non-zero exit status 3.

Any ideas what the issue could be? If you can point it out, I could probably fiddle around with it and fix it myself, but I have no idea where to begin after googling similar issues since I'm not familiar with the patch_hostside code base.

ModuleNotFoundError: No module named 'cocl_env'

Running
python cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-dnn

Results in

Traceback (most recent call last):
  File "cocl_plugins.py", line 11, in <module>
    from cocl_env import CLANG_HOME, COCL_LIB, COCL_INCLUDE, COCL_INSTALL_PREFIX
ModuleNotFoundError: No module named 'cocl_env'

Occurs on both ubuntu and arch

Unittests test_clone and test_create_cl_kernel fail

Hello,
I ran the unit test and 2 of them fail:

[==========] 79 tests from 13 test cases ran. (570 ms total)
[  PASSED  ] 77 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] test_struct_cloner.test_clone
[  FAILED  ] test_hostside_opencl_funcs.test_create_cl_kernel

test_struct_cloner.test_clone

[ RUN      ] test_struct_cloner.test_clone
structDefinitions struct mystruct {
    int f0;
    float f1;
    global float* f2;
    int f3;
    global float* f4;
};

structDefinition foo {
    int f0;
    float f1;
    global float* f2;
    int f3;
    global float* f4;
};

structNoPtrCl foo {
    int f0;
    float f1;
    int f2;
};

clCopyCode dest.f0 = src.f0;
dest.f1 = src.f1;
dest.f2 = 0;
dest.f3 = src.f2;
dest.f4 = 0;

testIR [; ModuleID = 'hostsideM'
source_filename = "hostsideM"

%"struct mystruct" = type { i32, float, float*, i32, float* }
%"struct mystruct_nopointers" = type { i32, float, i32 }

define void @testfunc() {
entry:
  %0 = alloca %"struct mystruct"
  %1 = alloca %"struct mystruct_nopointers"
  %2 = getelementptr inbounds %"struct mystruct", %"struct mystruct"* %0, i32 0, i32 0
  %3 = getelementptr inbounds %"struct mystruct_nopointers", %"struct mystruct_nopointers"* %1, i32 0, i32 0
  %loadint = load i32, i32* %2
  store volatile i32 %loadint, i32* %3
  %4 = getelementptr inbounds %"struct mystruct", %"struct mystruct"* %0, i32 0, i32 1
  %5 = getelementptr inbounds %"struct mystruct_nopointers", %"struct mystruct_nopointers"* %1, i32 0, i32 1
  %loadint1 = load float, float* %4
  store volatile float %loadint1, float* %5
  %6 = getelementptr inbounds %"struct mystruct", %"struct mystruct"* %0, i32 0, i32 3
  %7 = getelementptr inbounds %"struct mystruct_nopointers", %"struct mystruct_nopointers"* %1, i32 0, i32 2
  %loadint2 = load i32, i32* %6
  store volatile i32 %loadint2, i32* %7
  ret void
}
]
/home/thomas/src/cuda-on-cl/test/gtest/test_struct_cloner.cpp:159: Failure
Value of: testIR
  Actual: "; ModuleID = 'hostsideM'\nsource_filename = \"hostsideM\"\n\n%\"struct mystruct\" = type { i32, float, float*, i32, float* }\n%\"struct mystruct_nopointers\" = type { i32, float, i32 }\n\ndefine void @testfunc() {\nentry:\n  %0 = alloca %\"struct mystruct\"\n  %1 = alloca %\"struct mystruct_nopointers\"\n  %2 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 0\n  %3 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 0\n  %loadint = load i32, i32* %2\n  store volatile i32 %loadint, i32* %3\n  %4 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 1\n  %5 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 1\n  %loadint1 = load float, float* %4\n  store volatile float %loadint1, float* %5\n  %6 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 3\n  %7 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 2\n  %loadint2 = load i32, i32* %6\n  store volatile i32 %loadint2, i32* %7\n  ret void\n}\n"
Expected: expectedIR
Which is: "; ModuleID = 'hostsideM'\n\n%\"struct mystruct\" = type { i32, float, float*, i32, float* }\n%\"struct mystruct_nopointers\" = type { i32, float, i32 }\n\ndefine void @testfunc() {\nentry:\n  %0 = alloca %\"struct mystruct\"\n  %1 = alloca %\"struct mystruct_nopointers\"\n  %2 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 0\n  %3 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 0\n  %loadint = load i32, i32* %2\n  store volatile i32 %loadint, i32* %3\n  %4 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 1\n  %5 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 1\n  %loadint1 = load float, float* %4\n  store volatile float %loadint1, float* %5\n  %6 = getelementptr inbounds %\"struct mystruct\", %\"struct mystruct\"* %0, i32 0, i32 3\n  %7 = getelementptr inbounds %\"struct mystruct_nopointers\", %\"struct mystruct_nopointers\"* %1, i32 0, i32 2\n  %loadint2 = load i32, i32* %6\n  store volatile i32 %loadint2, i32* %7\n  ret void\n}\n"
[  FAILED  ] test_struct_cloner.test_clone (1 ms)

test_hostside_opencl_funcs.test_create_cl_kernel

[ RUN      ] test_hostside_opencl_funcs.test_create_cl_kernel
Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing
Using OpenCL device: Ellesmere
building kernel myKernel
 ... built
hostdata[0] 0
/home/thomas/src/cuda-on-cl/test/gtest/test_hostside_opencl_funcs.cpp:55: Failure
Value of: hostdata[0]
  Actual: 0
Expected: 123.0f
Which is: 123
[  FAILED  ] test_hostside_opencl_funcs.test_create_cl_kernel (85 ms)

System info:

kernel: 4.10.10-gentoo
opencl: amdgpu-pro-opencl-16.60.1.379184 on top of amdgpu open source stack
llvm: 3.9.1
harware: AMD RX480

clinfo output:

Platform #0
  Name:                                  AMD Accelerated Parallel Processing
  Version:                               OpenCL 2.0 AMD-APP (2264.10)

  Device #0
    Name:                                Ellesmere
    Type:                                GPU
    Version:                             OpenCL 1.2 AMD-APP (2264.10)
    Global memory size:                  7 GB 149 MB 396 kB 
    Local memory size:                   32 kB 
    Max work group size:                 256
    Max work group size:                 (256, 256, 256)

  Device #1
    Name:                                Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
    Type:                                CPU
    Version:                             OpenCL 1.2 AMD-APP (2264.10)
    Global memory size:                  7 GB 703 MB 352 kB 
    Local memory size:                   32 kB 
    Max work group size:                 1024
    Max work item sizes:                 (1024, 1024, 1024)

How would you read these results ?

Thanks,
Thomas.

Compatibility with TensorFlow XLA/NVIDIA TensorRT?

Hi

This is general question about deep learning inference acceleration with coriander. TF XLA good idea for inference optimization but limited available CUDA. And NVIDIA also release TensorRT as inference optimizer.
Thus anyone try using coriander for TF-XLA and TensorRT?

porting on aarch64

My board info:
arm64 debian(jessie) , GCC 6.2, LLVM 3.8.

  1. this step is success.
    git clone --recursive https://github.com/hughperkins/cuda-on-cl
    cd cuda-on-cl
    make
    sudo make install
  2. make run-test-cocl-cuda_sample

g++ -o build/test-cocl-cuda_sample build/test-cocl-cuda_sample.o -g -lcocl -lOpenCL
/usr/bin/ld: build/test-cocl-cuda_sample.o: relocation R_AARCH64_ADR_PREL_PG_HI21 against external C
/usr/bin/ld: build/test-cocl-cuda_sample.o(.text+0xe8): unresolvable R_AARCH64_ADR_PREL_PG_HI21 rel'
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
Makefile:128: recipe for target 'build/test-cocl-cuda_sample' failed
make: *** [build/test-cocl-cuda_sample] Error 1

sgemm stauts code -2048` for `test_dnn_conv.simple_gpu_conv_backward_filters` [Intel HD530, Mac Sierra]

[ RUN      ] test_dnn_conv.simple_gpu_conv_backward_filters
outH=5 outW=6
forward workspaceSizeBytes=3240
backward filter workspaceSizeBytes=5400
workspaceSizeBytes=5400
CL_GPUOFFSET var detected, changing gpu offset from 0 to 0
Using Apple , OpenCL platform: Apple
Using OpenCL device: Intel(R) HD Graphics 530
gpuMemoryAllocSize=10340
building kernel im2col_kernel
 ... built
sgemm status code -2048
unknown file: Failure
Unknown C++ exception thrown in the test body.

Error code means:

  kKernelLaunchError         = -2048, // Problem occurred when enqueuing the kernel

More Secure Route to Install LLVM 4.0

I was troubled that LLVM:

  • Has no HTTPS on the release downloads
  • Offers no SSL-protected way to access secure hashes or GPG keys for verification

For Ubuntu 16.04 I found that the Rust PPA provides LLVM / Clang 4.0, so this might make a useful alternative for installing the toolchains in a secure way:

sudo add-apt-repository ppa:ubuntu-toolchain-r/rust
sudo apt-get update
sudo apt-get install llvm-4.0 llvm-4.0-dev clang-4.0

where can I see the openCL file?

after compile with "cocl”, like “cocl_py cuda_sample.cu”.I only get a binary executable file "cuda_sample".
but is it possible that I could see the openCL source code (.cl file) at somewhere?

Code won't compile on High Sierra

Hi!
I've been having issues with coriander on macOS 10.13. I get this error whenever I try to run the included sample with this repo:

cocl args: cs.cu
LLVM_COMPILE_FLAGS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
+ /usr/local/opt/llvm-4.0/bin/clang++ -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ -D__CUDACC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O2 -S -stdlib=libc++ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/include/EasyCL -I/usr/local/include/cocl -include /usr/local/include/cocl/cocl.h -include /usr/local/include/cocl/fake_funcs.h -include /usr/local/include/cocl/cocl_deviceside.h -I/usr/local/include ./cs.cu -o ./cs-device-noopt.ll
In file included from <built-in>:1:
In file included from /usr/local/include/cocl/cocl.h:10:
In file included from /usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/stdexcept:46:
In file included from /usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/exception:82:
In file included from /usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/cstdlib:86:
/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/bin/../include/c++/v1/stdlib.h:94:15: fatal error: 'stdlib.h' file not found
#include_next <stdlib.h>
              ^~~~~~~~~~
1 error generated.

Any help would be greatly appreciated.
P.S. Running this on a new 15" MacBook Pro, in case it helps :)

Diagnose and/or fix the eigen test failures

Coriander can now be run from within the Eigen test framework, on branch eigen-cl, https://bitbucket.org/eigen/eigen/pull-requests/240/opencl-12-on-mac/diff . The first groups of tests runs successfully :-), ie, you can do:

hg clone https://[email protected]/hughperkins/eigen
cd eigen
mkdir build
cd build
ccmake ..
# press 'c'
# set EIGEN_TEST_CORIANDER to ON
# press 'c' a couple of times, then 'g'
make cxx11_tensor_cuda_1
unsupported/test/cxx11_tensor_cuda_1

However, it would be nice if some more of the tests passed :-) . So there is an opportunity for one or more people to look into why the other tests are failing, and either raise an issue with the root cause(s), and/or fix said root causes :-)

[arch] testshfl.cu:50 Assertion `hostFloats1[0] == 1001' failed

make run-testshfl
OpenCL platform: AMD Accelerated Parallel Processing
OpenCL device: Ellesmere
1002
1003
1004
1006
1007
testshfl: /coriander/test/endtoend/testshfl.cu:50: int main(int, char **): Assertion `hostFloats1[0] == 1001' failed.
make[3]: *** [test/endtoend/CMakeFiles/run-testshfl.dir/build.make:61: test/endtoend/CMakeFiles/run-testshfl] Aborted (core dumped)
make[2]: *** [CMakeFiles/Makefile2:2785: test/endtoend/CMakeFiles/run-testshfl.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:627: test/endtoend/CMakeFiles/run-endtoend-tests.dir/rule] Error 2
make: *** [Makefile:307: run-endtoend-tests] Error 2

Just a note for myself

test failed on master 4f051a4 on iMac

[ RUN ] test_hostside_opencl_funcs.test_create_cl_kernel
Using Apple , OpenCL platform: Apple
Using OpenCL device: Iris Pro
building kernel myKernel
... built
hostdata[0] 0
/Volumes/Data/work/DeepLearning/tensorflow/tensorflow-cl/third_party/cuda-on-cl/test/gtest/test_hostside_opencl_funcs.cpp:55: Failure
Value of: hostdata[0]
Actual: 0
Expected: 123.0f
Which is: 123
[ FAILED ] test_hostside_opencl_funcs.test_create_cl_kernel (1087 ms)
[----------] 1 test from test_hostside_opencl_funcs (1087 ms total)

[----------] Global test environment tear-down
[==========] 57 tests from 9 test cases ran. (1139 ms total)
[ PASSED ] 56 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] test_hostside_opencl_funcs.test_create_cl_kernel

tf `random_op_gpu.cc` fails

tf random_op_gpu.cc fails.

Example output https://gist.github.com/hughperkins/a41080ce90bca976ba37d2e97cc868c4

There are a few issues:

  • detection of by-value structs is unreliable (fixed in eade62d)
  • parentheses not added around == expressions (fixed in 5e167c8 )
  • float constants that trigger scientific notation (eg 1e-7) are being displayed incorrectly (eg 1e-7.0f) . (fixed in 38dc48b )
  • (New!) _Z3logf not implemented, see below . (fixed in d000b5f)
  • (New!) readnone kernel parameters cause calling inconsistencies (fixed in eae83fd )
  • (New!) Need to implement umulhi, currently stubbed...
    } else if(functionName == "_Z8__umulhiii") {
    localValueInfo->setAddressSpace(0);
    localValueInfo->setExpression("0");
    return;
    . (Implemented in 1d9e0e8 )
  • (New!) Missing _Z6memcpyPvPKvm implementation, on device-side . (implemented in c24f828 )

To reproduce the issue:

Relevant bytecode:

Example generated OpenCL:

Example of _Z3logf not implemented issue. Output during running the test_misc.py script:

__internal__ build log: 
<program source>:440:10: error: assigning to 'float' from incompatible type 'void'
    v245 = _Z3logf((v224 < 1e-07f) ? 1e-07f : v224, scratch);
         ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<program source>:476:10: error: assigning to 'float' from incompatible type 'void'
    v277 = _Z3logf((v259 < 1e-07f) ? 1e-07f : v259, scratch);
         ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_Z6memcpyPvPKvm: in the OpenCL, we can see:

_Z6memcpyPvPKvm(v65, v64, 4, scratch);

and

void _Z6memcpyPvPKvm(char* v1, char* v2, long v3, local int *scratch) {
}

... so the declaration of _Z6memcpyPvPKvm is being treated as an implementation, and then called. We should at least replace this implementation with something workable. It might be ideal if missing implementations raised an excetpion perhaps.

../get-llvm-cxxflags.sh: line 9: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory

I'm reporting this from arch linux.

In the erroring scripts CLANG_HOME is used.
I think the errors are due to CLANG_HOME pointing to "/usr/local/opt/llvm-4.0" even though since v6.0.0 the default installation directory moved from /usr/local to ~/coriander

-- The C compiler identification is GNU 7.1.1
-- The CXX compiler identification is GNU 7.1.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
/coriander/cmake/get-llvm-cxxflags.sh: line 9: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory
/coriander/cmake/llvm-syslibs.sh: line 4: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory
/coriander/cmake/get-llvm-libs.sh: line 6: /usr/local/opt/llvm-4.0/bin/llvm-config: No such file or directory
-- Configuring done
-- Generating done
-- Build files have been written to: /coriander/build

CUDA code won't run

I tried compiling the following CUDA code with cocl cudatest.cu on a macOS Sierra machine with a GeForce GTX 1070:

#include <stdio.h>

#define SIZE 1000

__global__ void kernel_matrix_add(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] + input2[idx];
}

__global__ void kernel_matrix_multiply(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] * input2[idx];
}

__global__ void kernel_matrix_divide(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] / input2[idx];
}

__global__ void kernel_matrix_subtract(float *input1, float *input2, float *output) {
        const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        output[idx] = input1[idx] - input2[idx];
}


int main() {

        float * in1;
        float * in2;
        float * out;

        cudaHostAlloc(&in1, SIZE*sizeof(float), cudaHostAllocDefault);
        cudaHostAlloc(&in2, SIZE*sizeof(float), cudaHostAllocDefault);
        cudaHostAlloc(&out, SIZE*sizeof(float), cudaHostAllocDefault);

        for (int i = 0; i < SIZE; ++i) {
                in1[i] = i;
                in2[i] = i;
                out[i] = 0;
        }

        float * d_in1;
        float * d_in2;
        float * d_out;

        cudaMalloc(&d_in1, SIZE*sizeof(float));
        cudaMalloc(&d_in2, SIZE*sizeof(float));
        cudaMalloc(&d_out, SIZE*sizeof(float));

        cudaMemcpy(d_in1, in1, SIZE*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_in2, in2, SIZE*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_out, out, SIZE*sizeof(float), cudaMemcpyHostToDevice);

        kernel_matrix_multiply<<<SIZE / 1024 + 1, 1024>>>(d_in1, d_in2, d_out);

        cudaMemcpy(out, d_out, SIZE*sizeof(float), cudaMemcpyHostToDevice);

        printf("First 10 Results:\n");

        for (int i = 0; i < 10; ++i) {
                printf("%f\n", out[i]);
        }

}

But I got the following error:

cocl args: cudatest.cu
LLVM_COMPILE_FLAGS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
+ /usr/local/opt/llvm-4.0/bin/clang++ -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ -D__CUDACC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O2 -S -stdlib=libc++ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/opt/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/include/EasyCL -I/usr/local/include/cocl -include /usr/local/include/cocl/cocl.h -include /usr/local/include/cocl/fake_funcs.h -include /usr/local/include/cocl/cocl_deviceside.h -I/usr/local/include ./cudatest.cu -o ./cudatest-device-noopt.ll
./cudatest.cu:41:42: error: use of undeclared identifier 'cudaHostAllocDefault'
        cudaHostAlloc(&in1, SIZE*sizeof(float), cudaHostAllocDefault);
                                                ^
./cudatest.cu:42:42: error: use of undeclared identifier 'cudaHostAllocDefault'
        cudaHostAlloc(&in2, SIZE*sizeof(float), cudaHostAllocDefault);
                                                ^
./cudatest.cu:43:42: error: use of undeclared identifier 'cudaHostAllocDefault'
        cudaHostAlloc(&out, SIZE*sizeof(float), cudaHostAllocDefault);
                                                ^
3 errors generated.

Is there a limitation to coriander I'm not aware of? How can I make this compatible?

Thanks!

`global **` kernel parameters not supported

See hughperkins/tf-coriander#33 , currently global ** parameters, in byvalue structs, are not handled by Coriander. I think they should be.

Relevant bytecode: files prefixed split_lib_gpu in https://github.com/hughperkins/coriander/tree/895934eff7bdfe939cb62ea1f9ad7e3a06b3c8d5/test/tf

Example of current generated opencl: https://gist.github.com/hughperkins/becd78a579fd17d0fc7aeb277d5cee68

struct tensorflow__CudaDeviceArrayStruct {
    int f0;
    float* f1[8];
    global float** f2;
};

kernel void _ZN10tensorflow12_GL(global char* clmem0, global char* clmem1, uint v9_offset, int v10, int v11, int v12, uint v13_offset, local int *scratch);

kernel void _ZN10tensorflow12_GL(global char* clmem0, global char* clmem1, uint v9_offset, int v10, int v11, int v12, uint v13_offset, local int *scratch) {
    global struct tensorflow__CudaDeviceArrayStruct* v13 = (global struct tensorflow__CudaDeviceArrayStruct*)(clmem1 + v13_offset);
    global float* v9 = (global float*)(clmem0 + v9_offset);
...

As for how to handle this ... :-P . ???

Add batch normalization

Would be very cool to have batch normalization enabled. There is a short window of opportunity for someone to implement this, whilst I'm working on fixing split, and conv.

cocl works, cocl_py doesn't

I am using Python 2.7.10 on MacOS and find that cocl_py does not work in the way that cocl does. Is there something horribly wrong with my Python environment? I do not do anything to it, other than install Python 2 and 3 from Homebrew, should not be used here.

Thanks for your help. This is an awesome project.

Error

/opt/llvm/cocl/bin/cocl.py -g -O3 -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 nstream-cuda.cu -o nstream-cuda
Traceback (most recent call last):
  File "/opt/llvm/cocl/bin/cocl.py", line 114, in <module>
    DSTRIPPED = THISARG.REPLACE('-D-D', '-D')
AttributeError: 'str' object has no attribute 'REPLACE'

Working

/opt/llvm/cocl/bin/cocl -g -O3 -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION="2.16" nstream-cuda.cu -o nstream-cuda

WARNING: `cocl` is deprecated, and will be removed in a future release
Please use: `cocl_py`, which is easier to maintain, and portable

cocl args: -g -O3 -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 nstream-cuda.cu -o nstream-cuda
LLVM_COMPILE_FLAGS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/clang++ -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ -D__CUDACC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O2 -S -stdlib=libc++ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/cocl/include/EasyCL -I/opt/llvm/cocl/include/cocl -include /opt/llvm/cocl/include/cocl/cocl.h -include /opt/llvm/cocl/include/cocl/fake_funcs.h -include /opt/llvm/cocl/include/cocl/cocl_deviceside.h -I/opt/llvm/cocl/include ./nstream-cuda.cu -o ./nstream-cuda-device-noopt.ll
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/opt -inline -mem2reg -instcombine -S -o ./nstream-cuda-device.ll ./nstream-cuda-device-noopt.ll
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/clang++ -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 -DUSE_CLEW -std=c++11 -x cuda -nocudainc --cuda-host-only -emit-llvm -O2 -S -g -D__CUDACC__ -D__CORIANDERCC__ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/cocl/include -I/opt/llvm/cocl/include/EasyCL -I/opt/llvm/cocl/include/cocl -stdlib=libc++ -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -include /opt/llvm/cocl/include/cocl/cocl.h -include /opt/llvm/cocl/include/cocl/fake_funcs.h -include /opt/llvm/cocl/include/cocl/cocl_hostside.h ./nstream-cuda.cu -o ./nstream-cuda-hostraw.ll
+ /opt/llvm/cocl/bin/patch_hostside --hostrawfile ./nstream-cuda-hostraw.ll --devicellfile ./nstream-cuda-device.ll --hostpatchedfile ./nstream-cuda-hostpatched.ll
+ /opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/bin/clang++ -D_MWAITXINTRIN_H_INCLUDED -DPRKVERSION=2.16 -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -fexceptions -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -DUSE_CLEW -O3 -g -c ./nstream-cuda-hostpatched.ll -o ./nstream-cuda.o
+ clang++ -Wl,-rpath,/opt/llvm/cocl/lib -Wl,-rpath,59357ORIGIN -g -o ./nstream-cuda ./nstream-cuda.o -L/opt/llvm/cocl/lib -lcocl -lclblast -leasycl -lclew -lpthread -L/opt/llvm/clang+llvm-4.0.0-x86_64-apple-darwin/lib -Wl,-search_paths_first -Wl,-headerpad_max_install_names -lLLVMLTO -lLLVMPasses -lLLVMObjCARCOpts -lLLVMMIRParser -lLLVMSymbolize -lLLVMDebugInfoPDB -lLLVMDebugInfoDWARF -lLLVMCoverage -lLLVMTableGen -lLLVMOrcJIT -lLLVMXCoreDisassembler -lLLVMXCoreCodeGen -lLLVMXCoreDesc -lLLVMXCoreInfo -lLLVMXCoreAsmPrinter -lLLVMSystemZDisassembler -lLLVMSystemZCodeGen -lLLVMSystemZAsmParser -lLLVMSystemZDesc -lLLVMSystemZInfo -lLLVMSystemZAsmPrinter -lLLVMSparcDisassembler -lLLVMSparcCodeGen -lLLVMSparcAsmParser -lLLVMSparcDesc -lLLVMSparcInfo -lLLVMSparcAsmPrinter -lLLVMRISCVDesc -lLLVMRISCVCodeGen -lLLVMRISCVInfo -lLLVMPowerPCDisassembler -lLLVMPowerPCCodeGen -lLLVMPowerPCAsmParser -lLLVMPowerPCDesc -lLLVMPowerPCInfo -lLLVMPowerPCAsmPrinter -lLLVMNVPTXCodeGen -lLLVMNVPTXDesc -lLLVMNVPTXInfo -lLLVMNVPTXAsmPrinter -lLLVMMSP430CodeGen -lLLVMMSP430Desc -lLLVMMSP430Info -lLLVMMSP430AsmPrinter -lLLVMMipsDisassembler -lLLVMMipsCodeGen -lLLVMMipsAsmParser -lLLVMMipsDesc -lLLVMMipsInfo -lLLVMMipsAsmPrinter -lLLVMLanaiDisassembler -lLLVMLanaiCodeGen -lLLVMLanaiAsmParser -lLLVMLanaiDesc -lLLVMLanaiInstPrinter -lLLVMLanaiInfo -lLLVMHexagonDisassembler -lLLVMHexagonCodeGen -lLLVMHexagonAsmParser -lLLVMHexagonDesc -lLLVMHexagonInfo -lLLVMBPFDisassembler -lLLVMBPFCodeGen -lLLVMBPFDesc -lLLVMBPFInfo -lLLVMBPFAsmPrinter -lLLVMARMDisassembler -lLLVMARMCodeGen -lLLVMARMAsmParser -lLLVMARMDesc -lLLVMARMInfo -lLLVMARMAsmPrinter -lLLVMAMDGPUDisassembler -lLLVMAMDGPUCodeGen -lLLVMAMDGPUAsmParser -lLLVMAMDGPUDesc -lLLVMAMDGPUInfo -lLLVMAMDGPUAsmPrinter -lLLVMAMDGPUUtils -lLLVMAArch64Disassembler -lLLVMAArch64CodeGen -lLLVMAArch64AsmParser -lLLVMAArch64Desc -lLLVMAArch64Info -lLLVMAArch64AsmPrinter -lLLVMAArch64Utils -lLLVMObjectYAML -lLLVMLibDriver -lLLVMOption -lLLVMX86Disassembler -lLLVMX86AsmParser -lLLVMX86CodeGen -lLLVMGlobalISel -lLLVMSelectionDAG -lLLVMAsmPrinter -lLLVMDebugInfoCodeView -lLLVMDebugInfoMSF -lLLVMX86Desc -lLLVMMCDisassembler -lLLVMX86Info -lLLVMX86AsmPrinter -lLLVMX86Utils -lLLVMMCJIT -lLLVMLineEditor -lLLVMInterpreter -lLLVMExecutionEngine -lLLVMRuntimeDyld -lLLVMCodeGen -lLLVMTarget -lLLVMCoroutines -lLLVMipo -lLLVMInstrumentation -lLLVMVectorize -lLLVMScalarOpts -lLLVMLinker -lLLVMIRReader -lLLVMAsmParser -lLLVMInstCombine -lLLVMTransformUtils -lLLVMBitWriter -lLLVMAnalysis -lLLVMObject -lLLVMMCParser -lLLVMMC -lLLVMBitReader -lLLVMProfileData -lLLVMCore -lLLVMSupport -lLLVMDemangle -lcurses -lz -lm

System Info

$ which python
/usr/bin/python
$ python --version
Python 2.7.10
$ uname -prsv
Darwin 16.7.0 Darwin Kernel Version 16.7.0: Mon Nov 13 21:56:25 PST 2017; root:xnu-3789.72.11~1/RELEASE_X86_64 i386

Passing double to kernel by value fails

Passing a double value to a kernel by value like this:

__global__ void foo(double* A, double x)
{
  int  i = threadIdx.x;
  A[i] = x;
}

Results in the following error when compiling with cocl:

error: invalid forward reference to function 'setKernelArgFloat' with wrong type!
  call void @setKernelArgFloat(double %loadCudaArg1)
            ^
1 error generated.

Update to llvm 4.0?

LLVM 4.0 was released 13 March 2017.
Using an outdated llvm seems counter productive.
Especially since it isn't even llvm 3.9, the last of it's 'version'.

Deep learning framework support with CUDA backend

Since Coriander can support CUDA on OpenCL, is there any success story already to make it work with CUDA backend of some deep learning framework such as TensorFlow and Caffe on non-CUDA platform?

[...] Eigen is a pre-requisite for running Eigen.

In: coriander/test/eigen/README.md
We see the following sentence:

Eigen is used by Tensorflow, so being able to build and run Eigen is a pre-requisite for running Eigen.

Surely you mean:

Eigen is used by Tensorflow, so being able to build and run Eigen is a pre-requisite for running Tensorflow.

If you want me to make a PR for it, ask me. :)

bad_alloc when launch cuda kernel, [PowerVR Rogue G6230]

When run cuda_sample or other test applications, there is some error below, which happened at the time launching cuda kernel (i.e, setValue in cuda_sample).

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Stack dump:
0.	<eof> parser at end of file

There is no useful info in the backtrace:
https://gist.github.com/JammyZhou/23782a6764ea2879dd179fff37fb64c0

After enable COCL_SPAM, the log is below:

OpenCL platform: PowerVR Rogue
OpenCL device: PowerVR Rogue G6230
setKernelArgInt32 2
setKernelArgFloat 123
kernelGo() kernel: _Z8setValuePfif
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Stack dump:
0.	<eof> parser at end of file

Arch support

This is mainly to track what still has to be done to install this on arch linux.

  • Arch users don't have the libtinfo5 package installed, yet it is required.
    Installing: https://aur.archlinux.org/packages/libtinfo5/ doesn't work.
  • The gathered llvm_version from ~/coriander/soft/llvm-4.0/bin/clang++ --version is incorrect('information' instead of 4.0.0). lol

~/coriander/soft/llvm-4.0/bin/clang++ --version results in:

/usr/lib/libtinfo.so.5: no version information available (required by /home/iame/coriander/soft/llvm-4.0/bin/clang++)
clang version 4.0.0 (tags/RELEASE_400/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/iame/coriander/soft/llvm-4.0/bin

I'll supply you with the required PRs to make it work on arch linux :)

Build failure for aarch64

When tried to build coriander on my aarch64 board, I ran into some error below, and this conditional branch issue seems common for some other projects as well on aarch64.

[ 2%] Built target clew
[ 2%] Building CXX object CMakeFiles/clblast.dir/src/CLBlast/src/database/database.cpp.o

/tmp/ccitmCEV.s: Assembler messages:
/tmp/ccitmCEV.s:22989: Error: conditional branch out of range
/tmp/ccitmCEV.s:22996: Error: conditional branch out of range
CMakeFiles/clblast.dir/build.make:54: recipe for target 'CMakeFiles/clblast.dir/src/CLBlast/src/database/database.cpp.o' failed
make[2]: *** [CMakeFiles/clblast.dir/src/CLBlast/src/database/database.cpp.o] Error 1
CMakeFiles/Makefile2:1980: recipe for target 'CMakeFiles/clblast.dir/all' failed
make[1]: *** [CMakeFiles/clblast.dir/all] Error 2
Makefile:117: recipe for target 'all' failed
make: *** [all] Error 2

Permission denied (public key)

I don't know if it was a problem with shh (the internet I'm using has had problems) or something else, but I can't download the CLBlast submodule using the python install.
I've tried to manually download your CLBlast into the /src folder but it doesn't seem to notice it.

I am running Ubuntu 16.04

The exact errors are:

/usr/bin/python /home/alex/coriander/bin/cocl_plugins.py install --repo-url https://github.com/hughperkins/coriander-clblast --git-branch master
Cloning into 'coriander-clblast'...
Cloning into 'src/CLBlast'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:hughperkins/CLBlast' into submodule path 'src/CLBlast' failed
('CLANG_HOME', '/home/alex/coriander/soft/llvm-4.0')
('COCL_INSTALL_PREFIX', '/home/alex/coriander')
('COCL_INCLUDE', '/home/alex/coriander/include')
('COCL_LIB', '/home/alex/coriander/lib')
plugin_name [coriander-clblast]

Traceback (most recent call last):
File "/home/alex/coriander/bin/cocl_plugins.py", line 114, in
func(**args_dict)
File "/home/alex/coriander/bin/cocl_plugins.py", line 80, in install
], cwd=git_dir))
File "/home/alex/coriander/bin/cocl_plugins.py", line 26, in check_output
res = subprocess.check_output(cmdlist, cwd=cwd)
File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['git', 'clone', '--recursive', 'https://github.com/hughperkins/coriander-clblast', '-b', 'master', 'coriander-clblast']' returned non-zero exit status 128
p.returncode 1
Traceback (most recent call last):
File "install_distro.py", line 221, in
main(**args.dict)
File "install_distro.py", line 212, in main
install_plugin(install_dir=install_dir, repo_url=repo_url, git_branch=git_branch)
File "install_distro.py", line 199, in install_plugin
'--git-branch', git_branch
File "install_distro.py", line 88, in run
assert p.returncode == 0
AssertionError

Error on python build script

When running python install_distro.py on macOS Sierra 10.12.5 I get:

cd to [/Users/laptop/coriander]
cd to [/Users/laptop/coriander/soft]
wget --progress=dot:giga http://releases.llvm.org/4.0.0/clang+llvm-4.0.0-x86_64-apple-darwin.tar.xz -O clang+llvm-4.0.0-x86_64-apple-darwin.tar.xz
Traceback (most recent call last):
File "install_distro.py", line 193, in
main(**args.dict)
File "install_distro.py", line 177, in main
maybe_install_llvm(install_dir=install_dir)
File "install_distro.py", line 138, in maybe_install_llvm
install_llvm(install_dir)
File "install_distro.py", line 113, in install_llvm
run(['wget', '--progress=dot:giga', target_url, '-O', filename])
File "install_distro.py", line 51, in run
p = subprocess.Popen(cmdlist, cwd=current_dir, stdout=f_out, stderr=subprocess.STDOUT, bufsize=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.