libocca / occa Goto Github PK
View Code? Open in Web Editor NEWPortable and vendor neutral framework for parallel programming on heterogeneous platforms.
Home Page: https://libocca.org
License: MIT License
Portable and vendor neutral framework for parallel programming on heterogeneous platforms.
Home Page: https://libocca.org
License: MIT License
Being able to retrieve the addresses on the device side for an array of them would be useful.
Would it be possible to allow a return statement (return 0 on success, otherwise failure)?
Pass a profile: true
to kernel properties to aggregate timing information
I am unable to get my program to work with streams and multiple devices in mode=CUDA. So, as a test, I modified the addVectors example under examples/usingStreams/ to include and extra device. The error is CUDA_ERROR_INVALID_HANDLE. See copy of modified code below.
#include <iostream>
#include "occa.hpp"
int main(int argc, char **argv){
int entries = 8;
float *a = new float[entries];
float *b = new float[entries];
float *ab = new float[entries];
for(int i = 0; i < entries; ++i){
a[i] = i;
b[i] = 1 - i;
ab[i] = 0;
}
occa::device device;
occa::device device1; // ADDED
occa::kernel addVectors;
occa::memory o_a, o_b, o_ab;
occa::stream streamA, streamB;
device.setup("mode = CUDA, deviceID = 0"); // UNCOMMENTED
device1.setup("mode = CUDA, deviceID = 1"); //ADDED
//device.setup("mode = OpenCL, platformID = 0, deviceID = 1"); //COMMENTED
streamA = device.getStream();
streamB = device.createStream();
o_a = device.malloc(entries*sizeof(float));
o_b = device.malloc(entries*sizeof(float));
o_ab = device.malloc(entries*sizeof(float));
addVectors = device.buildKernelFromSource("addVectors.okl",
"addVectors");
o_a.copyFrom(a);
o_b.copyFrom(b);
device.setStream(streamA);
addVectors(entries, o_a, o_b, o_ab);
device.setStream(streamB);
addVectors(entries, o_a, o_b, o_ab);
o_ab.copyTo(ab);
for(int i = 0; i < entries; ++i)
std::cout << i << ": " << ab[i] << '\n';
delete [] a;
delete [] b;
delete [] ab;
addVectors.free();
o_a.free();
o_b.free();
o_ab.free();
device.free();
device1.free(); //ADDED
}
The wrapMemory() funtion in CUDA.cpp causes a runtime error. The error occurs when finish() is called, but when I modified the code (see line 1300 in CUDA.cpp), I was able to get the kernel to execute successfully. Below is my suggested patch/replacement for the wrapMemory() function.
template <>
memory_v* device_t::wrapMemory(void *handle_,
const uintptr_t bytes){
memory_v *mem = new memory_t;
mem->dHandle = this;
mem->size = bytes;
mem->handle = (CUdeviceptr*) handle_;
mem->memInfo |= memFlag::isAWrapper;
return mem;
}
Similar to:
https://github.com/ekondis/mixbench
measure throughput and bandwidth for a GPU
Hi David,
I would like to use vector intrinsics when running on the CPU in OpenMP mode. For this purpose, I need to change all occaDeviceMallocs with the corresponding __aligned_malloc, f.i to ensure 32 byte alignement for vdouble4. I could not find an occaDeviceAignedMalloc in OCCA unfortunately.
The vector.hpp file has some vector intrinsics towards the bottom of the file, which is not complete yet, so I wrote a new class which is not working properly right now. I am guessing the reason may be byte alignment issues.
Daniel
Add coveralls coverage
I am having trouble to run on the Xeon Phi using the offload model. A simple program show below
prints 240 cores, but bin/occainfo does not detect the Xeon Phi (only CPU info is shown). I also tried to compile occa with OCCA_COI_ENABLED=1 but i learned this is deprecated, so I am trying to run in OpenMP mode.
int main(){
int nprocs;
#pragma offload target(mic)
nprocs = omp_get_num_procs();
printf("Hello %d\n",nprocs);
return 0;
}
With OpenMP mode I am redefining the outer loop of my kernels to run in parallel and on the MIC with:
#define myOuterFor OCCA_PRAGMA("offload target(mic)") occaParallelFor occaOuterFor0
I got errors after I added the mic pragma:
error: pointer variable "occaKernelArgs" in this offload region must be specified in an in/out/inout/nocopy clause
Is it possible to run OCCA in native mode ? I would like to launch more than one process per MIC card and a few threads instead of running threads only. But I guess this would be against the design of OCCA.
Daniel
Reduction is not trivial using OKL, instead we can make it a backend requirement to provide a way to do fast reductions.
@reduction void dot(const int entries,
const float *a,
const float *b,
@reduce float out) {
for (int i = 0; i < entries; ++i; reduce(32)) {
out += a[i] * b[i];
}
}
We can have multiple reductions in one kernel
@reduction void dot_norm(const int entries,
const float *a,
const float *b,
@reduce float dot,
@reduce float a_norm2,
@reduce float b_norm2) {
for (int i = 0; i < entries; ++i; reduce(32)) {
dot += a[i] * b[i];
a_norm2 = a[i] * a[i];
b_norm2 = b[i] * b[i];
}
}
I am in need of atomicAdd for floats in OCCA. Currently it seems only CUDA atomicAdd on floats work. AtomicAdd on doubles doesn't work in CUDA, and OpenCL does not seem to define it on floats/doubles at all. I know CUDA doesn't implement atomicAdd on 64 bit directly but via atomicCAS, but it would be great to have it.
My OpenCL is:
Version: OpenCL 1.2 AMD-APP (1445.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_amd_svm cl_khr_gl_event
occaType
to avoid allocationsAdd some common vector operations and linear algebra routines to occa::array
+ - * /
operationsSince we build the AST from parsed OKL code, have a feature to emulate the AST to search for
Test out Serial
and OpenMP
modes with ARM hardware to make sure compiler and compiler flags work. Find ARM-specific macros for vectorization options (float2
, float4
, ...)
The parser is silently ignoring the '[i]' in such a case.
definition of occaMax at line 209 in occa/defines/OpenCL.h has "<", needs ">".
We would need to:
metal
mode and backendFYI I can allocate OCCA device memory inside the occaKernel routine body (outside of actual kernel loops), which allows me to define temporary global device memory buffer(s) there instead of externally which then must be passed in as a parameter. This means I can keep the logic for allocation and management of temporary / work buffers used in kernels localized to the occaKernel routine code where it belongs.
I achieve this by passing in the occaDevice pointer cast as a uint64_t, then calling a special-purpose externally-defined C++ allocation function ("CreateOccaMemory()") inside the routine body to return an occaPointer to a buffer (global device memory). This allocation function simply takes the incoming occaDevice pointer plus a number of elements argument, and calls its OCCA malloc() function. The return type is (for example) occaPointer float*. I also have a corresponding memory free function ("ReleaseOccaMemory()")
Anyway, it would be much better (and helpful for all OCCA programmers) if there were built-in support to do this without even having to explicitly pass in the occaDevice pointer as a parameter from the outside. That pointer could be a hidden parameter generated and passed in by the OCCA build. And in terms of user syntax, the user could simply define global device memory via dynamically-sized arrays (auto-freed on return). Or similar to what I do, you could just have a new built-in OCCA allocator function (e.g., "occaMalloc()") that users call to create the global device memory, which is accessible in subsequent kernel loops. Of course, you would also want to provide an "occaFree()" function.
Update documentation in libocca.org
In https://github.com/libocca/occa/blob/master/include/occa/defines/OpenMP.hpp#L441 I think that the operators needs to be overloaded for private to private operations, such as
friend inline TM operator + (const occaPrivate_t &a, const occaPrivate_t &b)
or something like this (my c++ is a little rusty)
The README indicates that OCCA_CXX
is the right way to control the C++ compiler that is used to build OCCA. This is false in practice, both on Linux and MacOS.
You can see below that I want to use g++-7
but OCCA chooses g++
instead. On MacOS, it defaults to c++
and ignores my attempts to use clang++
and g++-7
.
jrhammon@klondike:/tmp$ git clone https://github.com/libocca/occa.git
Cloning into 'occa'...
remote: Counting objects: 18556, done.
remote: Total 18556 (delta 0), reused 0 (delta 0), pack-reused 18556
Receiving objects: 100% (18556/18556), 12.84 MiB | 9.35 MiB/s, done.
Resolving deltas: 100% (14266/14266), done.
Checking connectivity... done.
jrhammon@klondike:/tmp$ cd occa
jrhammon@klondike:/tmp/occa$ export OCCA_CXX=g++-7
jrhammon@klondike:/tmp/occa$ make
mkdir -p /tmp/occa//obj
mkdir -p /tmp/occa//obj/parser
mkdir -p /tmp/occa//obj/python
g++ -O3 -fPIC -DOCCA_COMPILED_DIR="/tmp/occa/" -o /tmp/occa//obj/Serial.o -fopenmp -DOCCA_OPENMP_ENABLED=1 -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1 -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5 -D OCCA_OS=LINUX_OS -c -I/tmp/occa//lib -I/tmp/occa//include -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include /tmp/occa//src/Serial.cpp
g++ -O3 -fPIC -DOCCA_COMPILED_DIR="/tmp/occa/" -o /tmp/occa//obj/OpenCL.o -fopenmp -DOCCA_OPENMP_ENABLED=1 -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1 -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5 -D OCCA_OS=LINUX_OS -c -I/tmp/occa//lib -I/tmp/occa//include -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include /tmp/occa//src/OpenCL.cpp
^Cmakefile:68: recipe for target '/tmp/occa//obj/OpenCL.o' failed
make: *** [/tmp/occa//obj/OpenCL.o] Interrupt
A number of the math functions are wrapped in occa already, for instance occasqrt.
Would it be possible to wrap the rest of the opencl supported math functions?
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mathFunctions.html
I particularly need hypot, but figure it could be worth just wrapping them all (or at least the opencl/cuda intersection).
Hi, I use branch 1.0. When I test examples I get errors.
In midgTest
, it says could not find function [midg] in file midg.okl
.
In mandelbulb
, it says no setupAide.hpp
.
Thank you.
I am only interested in OpenCL support. CUDA support is enabled by default even though it is not present active [1] on my machine. The build succeeds but testing fails.
I cannot find a build system control option to disable CUDA, short of hacking makefiles by hand.
jrhammon@klondike:/tmp$ git clone https://github.com/libocca/occa.git
Cloning into 'occa'...
remote: Counting objects: 18556, done.
remote: Total 18556 (delta 0), reused 0 (delta 0), pack-reused 18556
Receiving objects: 100% (18556/18556), 12.84 MiB | 9.78 MiB/s, done.
Resolving deltas: 99% (14184/14266)
Resolving deltas: 100% (14266/14266), done.
Checking connectivity... done.
jrhammon@klondike:/tmp$ cd occa
jrhammon@klondike:/tmp/occa$ export OCCA_CXX=g++-7
jrhammon@klondike:/tmp/occa$ export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
jrhammon@klondike:/tmp/occa$ make -j16 >& make.log
jrhammon@klondike:/tmp/occa$ make test
echo '---[ Testing ]--------------------------'
---[ Testing ]--------------------------
cd /tmp/occa/; \
make -j 4 CXXFLAGS='-g' FCFLAGS='-g'
make[1]: Entering directory '/tmp/occa'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/tmp/occa'
cd /tmp/occa//examples/addVectors/cpp; \
make -j 4 CXXFLAGS='-g' FCFLAGS='-g'; \
./main
make[1]: Entering directory '/tmp/occa/examples/addVectors/cpp'
g++ -g -o /tmp/occa/examples/addVectors/cpp//main -fopenmp -DOCCA_OPENMP_ENABLED=1 -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1 -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5 -D OCCA_OS=LINUX_OS /tmp/occa/examples/addVectors/cpp//main.cpp -I/tmp/occa/lib -I/tmp/occa/include -L/tmp/occa/lib -I/usr/local/cuda-9.0/include -I/usr/local/cuda-9.0/include -locca -lm -lrt -ldl -L/usr/local/cuda-9.0/lib64 -lOpenCL -L/usr/local/cuda-9.0/lib64/stubs -lcuda
In file included from /tmp/occa/include/occa.hpp:18:0,
from /tmp/occa/examples/addVectors/cpp//main.cpp:3:
/tmp/occa/include/occa/defines/vector.hpp:20:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas]
#pragma GCC diagnostic ignored "-Wint-in-bool-context"
^
make[1]: Leaving directory '/tmp/occa/examples/addVectors/cpp'
./main: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
makefile:116: recipe for target 'test' failed
[1] /usr/local/cuda-9.0
exists but the requisite drivers are not installed and thus CUDA does not work.
As far as I can tell there is no mechanism in occa that allows me to query the main memory for a device.
Two ways that I could see doing this are
Would either of these by feasible to implement within occa? We were having problems locking up our boxes when allocating to much memory on the device, so want to put in some checks to make sure this doesn't occur in our codes.
Formalize a way to modify kernel AST with custom attributes
We're aiming for a production-grade release with 1.0. A large amount of refactoring is being done for the API and OKL parser.
Issues
occa::array
We're releasing documentation with
This has been a long and crucial part missing in our release process. We are going to be making testing a required part in our release process. Coverage metrics will also be included.
Helper methods were added to facilitate collaboration and extending OCCA in other codebases.
Examples include:
occa://occa/foo.okl
)We're aiming to unify backends through a general API. However, we would also like to expose various features unique to different backends.
For example:
deviceID
in OpenCL and CUDA, etcWe added properties in 1.0 to help create a more flexible API. Properties use the standard JSON representation to maintain a 1-1 mapping from data to string.
The following JS augmentations are also supported:
{ mode: 'Serial', }
While we support loading JSON strings with occa::json
, properties are always JSON objects. Hence, we also support ignoring braces when initializing properties from a string.
occa::properties(mode: 'Serial')
OCCA modes are composed of 3 classes: device, kernel, and memory. Modes can now be added outside of OCCA through registering the modes. The mode constructor will register the classes with its respective mode.
occa::mode<openmp::modeInfo,
openmp::device,
openmp::kernel,
openmp::memory> mode("OpenMP");
The OKL kernel language is one of the key features in OCCA. Adding JIT compilation of kernels gives us the flexibility of reusing kernels for different OCCA modes. We are refactoring the parser from the bottom up for robustness and flexibility.
Custom C++ classes can now be passed as kernel arguments by adding the occa::kernelArg
cast operator
operator occa::kernelArg ();
The class is not restricted by OCCA in the amount of arguments or types it can add to the kernel call. The backend could restrict the number of arguments or types, for example 256 is the maximum arguments in CPU modes.
To see an example of this use, check occa::array
.
Note: The OKL kernel has to match the same number arguments and types.
We're adding some common vector operations and linear algebra routines to occa::array
+ - * /
operationsWe will officially support the following frontends in the 1.0 release
Language | Github Repository |
---|---|
C | libocca/occa |
C++ | libocca/occa |
Python | libocca/occa-python |
We'll be adding the following in the 1.1 release
Language | Github Repository |
---|---|
Python | libocca/occa-python |
Fortran | libocca/occa-fortran |
Potential future languages
Language | Github Repository |
---|---|
Java | libocca/occa-java |
Julia | libocca/occa-julia |
Documentation and testing will be included
Is there a reason why that is ? I added it manually in my code but i was wondering if this is a bug or a feature in OCCA2 ?
Found in laghos/kernels/cpu/quadratureData.okl
when (cNorm > 1e-16)
and (maxNorm > 1e-16)
lines 370 and 377
Regarding 96d9460:
I'm wondering about the right path forward here. Sure, I can call loopy myself using whatever syscall, and that's essentially equivalent to the code that was removed. But since loopy can take a couple seconds to run, the right thing to do would be to cache its output. I was rather hoping to piggyback off of OCCA's compiler cache to do so, which would point in the direction of tighter integration. So what I'm looking for here is a pronouncement by either @tcew or @dmed256 along one of the following lines:
cc @lcw
We're adding documentation in readthedocs: http://occa.readthedocs.io
Once the draft is done, we'll review it one more before 1.0 release
Try out the following to see if we can add __constant
to work on native CUDA kernels.
Manually put the constant variable/array with a name in the kernel file
__constant float myVar;
__constant float myArray[10];
Compile kernel
occa::kernel foo = device.buildKernel('foo.cu', 'foo')
Use an OCCA helper method to initialize the constant value. The 'extra work' part is keeping track of the variable name and size outside of the kernel:
float myVar = 10;
float *myArray = new float[10];
// Initialize myArray
occa::cuda::initConstantMemory(foo, 'myVar' , &myVar , sizeof(float));
occa::cuda::initConstantMemory(foo, 'myArray', myArray, 10 * sizeof(float));
Hi,
I am having problems include a file with a device function. It looks like the device function gets multiply defined. For example I have a file that is included with occaKernelInfoAddInclde
that contains
void conn_mapping(const int tree, const dfloat a, const dfloat b,
const dfloat c, dfloat *x, dfloat *y, float *z)
{
*x = a;
*y = b;
*z = c;
}
when this file gets included to the kernel info then compiling kernels has an issue like
Compiling [compute_X]
g++ -x c++ -fPIC -shared /Users/lucas/._occa/kernels/a21338d5e3be2cbf/source.occa -o /Users/lucas/._occa/kernels/a21338d5e3be2cbf/binary -I/Users/lucas/research/code/ape/occa//include -L/Users/lucas/research/code/ape/occa//lib -locca
Compiling [compute_X0]
g++ -x c++ -fPIC -shared /Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa -o /Users/lucas/._occa/kernels/3854aa958acba9fd/binary -I/Users/lucas/research/code/ape/occa//include -L/Users/lucas/research/code/ape/occa//lib -locca
/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa: In function 'void conn_mapping(int, float, float, float, float*, float*, float*)':
/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa:88:19: error: redefinition of 'void conn_mapping(int, float, float, float, float*, float*, float*)'
occaFunction void conn_mapping(occaConst int tree,
^
/Users/lucas/._occa/kernels/3854aa958acba9fd/source.occa:71:19: note: 'void conn_mapping(int, float, float, float, float*, float*, float*)' previously defined here
occaFunction void conn_mapping(const int tree, const dfloat a, const dfloat b,
^
Looking at the source it seems that a21338d5e3be2cbf/source.occa
is okay and only has one definition of conn_mapping
but 3854aa958acba9fd/source.occa
is not okay. Here is the relevant section of 3854aa958acba9fd/source.occa
occaFunction void conn_mapping(const int tree, const dfloat a, const dfloat b,
const dfloat c, dfloat *occaRestrict x,
dfloat *occaRestrict y, float *occaRestrict z)
{
*x = a;
*y = b;
*z = c;
}
occaFunction void conn_mapping(occaConst int tree,
occaConst float a,
occaConst float b,
occaConst float c,
float * occaRestrict x,
float * occaRestrict y,
float * occaRestrict z);
occaFunction void conn_mapping(occaConst int tree,
occaConst float a,
occaConst float b,
occaConst float c,
float * occaRestrict x,
float * occaRestrict y,
float * occaRestrict z) {
*x = a;
*y = b;
*z = c;
}
I looked but cannot figure out why this function is getting defined twice. Any help would be appreciated.
Thanks,
Lucas
Concerning vector data types.
a) The opencl floatn vectors support math functions such as sin, cos etc, but occa's support in non-opencl mode lacks that.
b) Support for all float-n types such as float3, float6, float9 which can be used to represent vector, symmetric tensor and tensor.
Handle most of C99
When I run example occa/examples/addVectors/c/
in cuda mode I get a bunch of multiple define warnings between primitives.hpp
and CUDA.hpp
as well as defines.hpp
There is also an error from an unfound file: occa/defines.hpp
These memory leaks create valgrind noise for users that link to libocca, even if they never call into the library.
==14201== HEAP SUMMARY:
==14201== in use at exit: 232 bytes in 7 blocks
==14201== total heap usage: 35 allocs, 28 frees, 111,478 bytes allocated
==14201==
==14201== 8 bytes in 1 blocks are definitely lost in loss record 2 of 7
==14201== at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201== by 0x58B2F2F: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201== by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201== by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201== by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201== by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)
==14201==
==14201== 8 bytes in 1 blocks are definitely lost in loss record 3 of 7
==14201== at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201== by 0x58B2F4C: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201== by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201== by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201== by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201== by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)
==14201==
==14201== 8 bytes in 1 blocks are definitely lost in loss record 4 of 7
==14201== at 0x4C2D52F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14201== by 0x58B2F69: occa::env::registerFileOpeners() (in /home/jed/src/occa/lib/libocca.so)
==14201== by 0x58B5B9C: occa::env::initialize() (in /home/jed/src/occa/lib/libocca.so)
==14201== by 0x400F519: call_init.part.0 (in /usr/lib/ld-2.26.so)
==14201== by 0x400F625: _dl_init (in /usr/lib/ld-2.26.so)
==14201== by 0x4000F69: ??? (in /usr/lib/ld-2.26.so)
I am having troubles using private float4
variables. Here is a small example of a code that does not work
#include <iostream>
#include "occa.hpp"
int main(int argc, char **argv){
occa::printAvailableDevices();
int nvec = 4;
int entries = 2;
const char *kernelString =
"occaKernel void addVectors(occaKernelInfoArg, \n"
" occaConst int occaVariable entries, \n"
" occaConst occaPointer float4 * a, \n"
" occaConst occaPointer float4 * b, \n"
" occaPointer float4 * ab){ \n"
" occaParallelFor \n"
" occaOuterFor0{ \n"
" occaPrivate(float4, c); \n"
" occaInnerFor0{ \n"
" const int N = occaGlobalId0; \n"
" if(N < entries){ \n"
" c = b; \n"
" ab[N].x = a[N].x + c[N].x; \n"
" ab[N].y = a[N].y + c[N].y; \n"
" ab[N].z = a[N].z + c[N].z; \n"
" ab[N].w = a[N].w + c[N].w; \n"
" } \n"
" } \n"
" } \n"
"} \n";
float *a = new float[nvec*entries];
float *b = new float[nvec*entries];
float *ab = new float[nvec*entries];
for(int i = 0; i < nvec*entries; ++i){
a[i] = i;
b[i] = 1 - i;
ab[i] = 0;
}
occa::device device;
occa::kernel addVectors;
occa::kernelInfo kInfo;
occa::memory o_a, o_b, o_ab;
//---[ Device setup with string flags ]-------------------
// device.setup("mode = Serial");
device.setup("mode = OpenMP , schedule = compact, chunk = 10");
// device.setup("mode = OpenCL , platformID = 0, deviceID = 1");
// device.setup("mode = CUDA , deviceID = 0");
// device.setup("mode = Pthreads, threadCount = 4, schedule = compact, pinnedCores = [0, 0, 1, 1]");
// device.setup("mode = COI , deviceID = 0");
//========================================================
o_a = device.malloc(nvec*entries*sizeof(float));
o_b = device.malloc(nvec*entries*sizeof(float));
o_ab = device.malloc(nvec*entries*sizeof(float));
addVectors = device.buildKernelFromString(kernelString,
"addVectors", kInfo,
occa::usingNative);
int dims = 1;
int itemsPerGroup(16);
int groups((nvec*entries + itemsPerGroup - 1)/itemsPerGroup);
addVectors.setWorkingDims(dims, itemsPerGroup, groups);
o_a.copyFrom(a);
o_b.copyFrom(b);
addVectors(entries, o_a, o_b, o_ab);
o_ab.copyTo(ab);
for(int i = 0; i < nvec*entries; ++i)
std::cout << i << ": " << ab[i] << '\n';
for(int i = 0; i < nvec*entries; ++i){
if(ab[i] != (a[i] + b[i]))
throw 1;
}
delete [] a;
delete [] b;
delete [] ab;
addVectors.free();
o_a.free();
o_b.free();
o_ab.free();
device.free();
return 0;
}
and here are the error messages that I get
> ./main
==============o=======================o==========================================
CPU Info | Processor Name | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
| Cores | 6
| Memory (RAM) | 31 GB
| Clock Frequency | 1.205 GHz
| SIMD Instruction Set | SSE2
| SIMD Width | 128 bits
| L1 Cache Size (d) | 32K
| L2 Cache Size | 256K
| L3 Cache Size | 12288K
==============o=======================o==========================================
OpenCL | Device Name | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
| Driver Vendor | Intel
| Platform ID | 0
| Device ID | 0
| Memory | 31 GB
|-----------------------+------------------------------------------
| Device Name | Tahiti
| Driver Vendor | AMD
| Platform ID | 1
| Device ID | 0
| Memory | 2 GB
|-----------------------+------------------------------------------
| Device Name | Tahiti
| Driver Vendor | AMD
| Platform ID | 1
| Device ID | 1
| Memory | 2 GB
|-----------------------+------------------------------------------
| Device Name | Tahiti
| Driver Vendor | AMD
| Platform ID | 1
| Device ID | 2
| Memory | 2 GB
|-----------------------+------------------------------------------
| Device Name | Tahiti
| Driver Vendor | AMD
| Platform ID | 1
| Device ID | 3
| Memory | 2 GB
|-----------------------+------------------------------------------
| Device Name | Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
| Driver Vendor | Intel
| Platform ID | 1
| Device ID | 4
| Memory | 31 GB
==============o=======================o==========================================
Compiling [addVectors]
g++ -x c++ -fPIC -shared -fopenmp /home/lucas/._occa/kernels/b65af17897881926/source.occa -o /home/lucas/._occa/kernels/b65af17897881926/binary -I/home/lucas/research/code/occa//include -L/home/lucas/research/code/occa//lib -locca
/home/lucas/._occa/kernels/b65af17897881926/source.occa: In function ‘void addVectors(const int*, int, int, int, const int&, const float4*, const float4*, float4*)’:
/home/lucas/._occa/kernels/b65af17897881926/source.occa:14:11: error: no match for ‘operator=’ (operand types are ‘occaPrivate_t<float4, 1>’ and ‘const float4*’)
c = b;
^
/home/lucas/._occa/kernels/b65af17897881926/source.occa:14:11: note: candidates are:
In file included from /home/lucas/._occa/kernels/b65af17897881926/source.occa:1:0:
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:479:14: note: TM& occaPrivate_t<TM, SIZE>::operator=(const occaPrivate_t<TM, SIZE>&) [with TM = float4; int SIZE = 1]
inline TM& operator = (const occaPrivate_t &r) {
^
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:479:14: note: no known conversion for argument 1 from ‘const float4*’ to ‘const occaPrivate_t<float4, 1>&’
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:484:14: note: TM& occaPrivate_t<TM, SIZE>::operator=(const TM&) [with TM = float4; int SIZE = 1]
inline TM& operator = (const TM &t){
^
/home/lucas/._occa/libraries/occa/defines/OpenMP.hpp:484:14: note: no known conversion for argument 1 from ‘const float4*’ to ‘const float4&’
---[ Error ]--------------------------------------------
File : /home/lucas/research/code/occa/src/OpenMP.cpp
Function : buildFromSource
Line : 305
Error : Compilation error
========================================================
Am I doing something wrong? If not I would like to put in a feature request for vector private variables.
When #include
-ing files, the kernel hash doesn't properly reflect the content
Properly detect loops that use shared memory, insert barriers before loop starts if needed
Am I doing something wrong?
g++ -O3 -fPIC -o /home/hugo/occa/obj/timer.o -fopenmp -DOCCA_OPENMP_ENABLED=1 -O3 -D __extern_always_inline=inline -DOCCA_DEBUG_ENABLED=0 -DNDEBUG=1 -DOCCA_SHOW_WARNINGS=0 -DOCCA_CHECK_ENABLED=1 -DOCCA_OPENCL_ENABLED=1 -DOCCA_CUDA_ENABLED=1 -DOCCA_HSA_ENABLED=0 -DOCCA_COI_ENABLED=1 -D LINUX_OS=1 -D OSX_OS=2 -D WINDOWS_OS=4 -D WINUX_OS=5 -D OCCA_OS=LINUX_OS -c -I/home/hugo/occa/lib -I/home/hugo/occa/include -I./include -I/usr/local/cuda-7.0/include -I/usr/local/cuda-7.0/include /home/hugo/occa/src/timer.cpp
In file included from /home/hugo/occa/include/occa/timer.hpp:4:0,
from /home/hugo/occa/src/timer.cpp:1:
/home/hugo/occa/include/occa/base.hpp:1144:29: error: 'occa::device occa::coi::wrapDevice' redeclared as different kind of symbol
occa::device wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1141:18: error: previous declaration of 'occa::device occa::coi::wrapDevice(void*)'
occa::device wrapDevice(void *coiDevicePtr);
^
/home/hugo/occa/include/occa/base.hpp:1144:29: error: 'COIENGINE' was not declared in this scope
occa::device wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1245:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1245:60: error: 'occa::device occa::coi::wrapDevice(int)' should have been declared inside 'occa::coi'
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1361:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
/home/hugo/occa/include/occa/base.hpp:1541:41: error: 'COIENGINE' has not been declared
friend occa::device coi::wrapDevice(COIENGINE coiDevice);
^
make: *** [/home/hugo/occa/obj/timer.o] Error 1
To work with intel compilers, I think that https://github.com/libocca/occa/blob/master/include/occa/defines/OpenMP.hpp#L437
#define occaUnroll(N) occaUnroll2(unroll N)
should be
#define occaUnroll(N) occaUnroll2(unroll(N))
I should be making a new cl_mem when wrapping and copying the value, not keeping the pointer of the cl_mem*
Unnecessary printf statement
src/FC.cpp line 10 - printf("type = %p\n", *type);
Also when occa is run with thousands of GPUs, the info message for cached kernel becomes too much, so I suggest an option to turn it on/off.
//std::cout << "Found cached binary of [" << compressFilename(filename) << "] in [" << compressFilename(binaryFile) << "]\n";
regards,
Daniel
Hi everybody.
Waiting for the release of libocca 1.0, I strongly suggest to add support for CUDA/OpenCL kernel source in CUDA/OpenCL OCCA mode: this would be of great advantage when porting an existent CUDA accelerated program with OCCA.
What I suggest is:
Thank you for you great work and support
Luca Ferraro
Hi, I am new to occa. I work on Ubuntu 16 PC, with cuda 8.0 installed. While I want to compile occa, I get error that the program is not compatible with libcuda.so. Thank you.
There's a problem with resource deallocation when using OCCA in CUDA native mode.
File : <stripped>/occa/src/modes/cuda/kernel.cpp
Function : free
Line : 260
Message : Kernel (addVectors) : Unloading Module
Error : CUDA Error [ 709 ]: CUDA_ERROR_CONTEXT_IS_DESTROYED
Stack :
6 occa::cuda::error(cudaError_enum, std::string const&, std::string const&, int, std::string const&)
5 occa::cuda::kernel::free()
4 occa::kernel::removeKHandleRef()
3 ./main()
2 /lib64/libc.so.6(__libc_start_main+0xf5)
1 ./main()
I attach a simple diff file to apply to examples/nativeAddVectors which should reproduce the problem.
Thank you for your precious work,
In many situations, we need to pass subset of memory buffers to the kernel, for example when working with many streams (in order to implement a pipe-line to process a big buffer in chunks with the same kernel) you require that the kernel operates on different portions of a memory buffer.
for (int i=0; i<nchunks; ++i) {
size_t starting_index = i * chunksize;
// stuff like device.setStream(.......)
// asynch copy stuff for buffers from base + starting_index of size chunksize
kernel( occa::memory(bufA, start_index), occa::memory(bufB, start_index) );
// asynch copy back stuff
}
What I suggest is to provide a simple way to create occa:memory objects from other existing occa::memory, using an offset: this is a "must have" in all situations where you need to process subsection of a occa::memory buffer with a kernel, for example when using multiple streams for
processing a buffer in chunks (see attached example).
You can go through a operator+ or a copy costructor (or both), so to let user write something like that:
// modified source from the master branch of libocca
template<> memory_t<CUDA>::memory_t(const memory_t<CUDA> &m, const
uintptr_t offset) {
*this =m;
handle = new CUdeviceptr;
CUdeviceptr base = (CUdeviceptr)m.handle; (CUdeviceptr*)handle =
base + offset;
}
For OpenCL we can use subbuffers.
We must grant memory resources will not to be destroyed when the newly shifted memory object goes out of scope.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.