fynv / thrustrtc Goto Github PK

CUDA tool set for non-C++ languages that provides similar functionality like Thrust, with NVRTC at its core.

License: Other

CMake 2.54% C++ 54.61% C 5.99% Python 9.95% C# 16.66% Java 9.53% Batchfile 0.36% Shell 0.36%

cuda nvrtc thrust

thrustrtc's Introduction

ThrustRTC

The aim of the project is to provide a library of general GPU algorithms, functionally similar to Thrust, that can be used in non-C++ programming launguages that has an interface with C/C++ (Python, C#, JAVA etc).

This projects uses a new CUDA programming paradigm: NVRTC + dynamic-instantiation, as an alternative to the well establish "CUDA runtime + static compilation + templates" paradigm.

Click here to learn more about the new paradigm.

Using ThrustRTC in different languages

The usage of this library is quite simlar to using Thrust, except that you can use it Python, C# and JAVA, and CUDA SDK is not required.

Thrust, C++:

#include <vector>
#include <thrust/replace.h>
#include <thrust/device_vector.h>

std::vector<int> hdata({ 1, 2, 3, 1, 2  });
thrust::device_vector<int> A(hdata);
thrust::replace(A.begin(), A.end(), 1, 99);

// A contains { 99, 2, 3, 99, 2}

ThrustRTC, in C++:

#include "TRTCContext.h"
#include "DVVector.h"
#include "replace.h"

int hdata[5] = { 1,2,3,1,2 };
DVVector A("int32_t", 5, hdata);
TRTC_Replace(A, DVInt32(1), DVInt32(99));

// A contains { 99, 2, 3, 99, 2}

ThrustRTC, in Python:

import ThrustRTC as trtc

A = trtc.device_vector_from_list([1, 2, 3, 1, 2], 'int32_t')
trtc.Replace(A, trtc.DVInt32(1), trtc.DVInt32(99))

# A contains [99, 2, 3, 99, 2]

ThrustRTC, in C#:

using ThrustRTCSharp;

DVVector A = new DVVector(new int[] { 1, 2, 3, 1, 2 });
TRTC.Replace(A, new DVInt32(1), new DVInt32(99));

// A contains { 99, 2, 3, 99, 2}

ThrustRTC, in JAVA:

import JThrustRTC.*;

DVVector vec = new DVVector(new int[] { 1, 2, 3, 1, 2 });
TRTC.Replace(vec, new DVInt32(1), new DVInt32(99));

// A contains { 99, 2, 3, 99, 2}

A significant difference between ThrustRTC and Thrust is that ThrustRTC does not include the iterators. All operations explicitly work on vectors types. There are adaptive objects that can be used to map to a sub-range of a vector instead of using the whole vector.

Quick Start Guide

Quick Start Guide - for Python users

Quick Start Guide - for C# users

Quick Start Guide - for JAVA users

Demos

Using ThrustRTC for histogram calculation and k-means clustering.

https://fynv.github.io/ThrustRTC/Demo.html

License

I've decided to license this project under '"Anti 996" License'

Basically, you can use the code any way you like unless you are working for a 996 company.

thrustrtc's People

Contributors

Stargazers

Watchers

Forkers

sarwhelan sumit-byte jasonalexander slayoo zsy0828 poohcreate

thrustrtc's Issues

pip install ThrustRTC in Python 3.8

Installing ThrustRTC by pip works only with Python 3.7, but with Python 3.8 I have:

ERROR: Could not find a version that satisfies the requirement ThrustRTC (from versions: none)
ERROR: No matching distribution found for ThrustRTC

Large alloc bug, regression in 0.3.11

Here Google Colab Reproduce Example CURandRTC latest + ThrustRTC 0.3.11 causes the kernel restart with the following message:

Jun 21, 2021, 4:16:23 PM	WARNING	terminate called after throwing an instance of 'std::bad_alloc'
Jun 21, 2021, 4:16:23 PM	WARNING	tcmalloc: large alloc 140212504444928 bytes == (nil) @ 0x7f85f414b887 0x7f85f2a41c29 0x7f85f2a42afb 0x7f85f2a42bb4 0x7f85f2a42c97 0x7f85cc3291fc 0x7f85cc33a86e 0x7f85cc326bec 0x7f85cc32818a 0x7f85cc34b621 0x7f85cc5f1dec 0x7f85cc5f0f55 0x7f85cc813df6 0x557f06d9e462 0x557f06e11fd5 0x557f06d9f30a 0x557f06e0d60e 0x557f06e0c4ae 0x557f06e0c1b3 0x557f06e0a660 0x557f06d9db59 0x557f06d9da50 0x557f06e11453 0x557f06e0c4ae 0x557f06d9f3ea 0x557f06e0d60e 0x557f06e0c4ae 0x557f06d9f3ea 0x557f06e0e32a 0x557f06e0c7ad 0x557f06d9fc9f

With ThrustRTC 0.3.10 it works.
I wonder if it is related to #14

API addition to expose sync/wait functionality

Currently when profiling code that uses ThrustRTC, execution times are in a potentially misleading way associated with subsequent API calls - apparently due to asyncronous execution. Would it be possible to expose some wait/sync feature in the ThrustRTC API so that one could enforce completion of execution of launched kernels?

report wrong number of argument for launch_n as exception

It would be of great help if such things:

ThrustRTC/python/api_Context.cpp

Line 88 in 6b40fc0

    
           printf("Wrong number of arguments received. %d required, %d received.", (int)num_params, (int)size);

were reported as exceptions.
Thanks

Fill() regression in 0.3.9

The following used to work up until 0.3.8 but fails with newer versions without any clear error message:

import ThrustRTC as T
v = T.device_vector("float", 10)
T.Fill(v, T.DVFloat(0.))

Here's a example on fresh Google Colab GPU runtime:

OK on 0.3.8:

pytest: "Windows fatal exception: access violation" instead of compiler error message

The following minimal code (with missing semicolon in C code):

import ThrustRTC as trtc

def test_bug():
    loop = trtc.For(['a'], 'i', "a[i]=0")
    a = trtc.device_vector_from_list((1,), 'int64_t')
    loop.launch_n(1, [a])

produces a "Windows fatal exception: access violation" (or sometimes "page error") when run with pytest (gives a correct error message when run without pytest):

test.py Windows fatal exception: access violation

Current thread 0x000036d4 (most recent call first):
  File "...\site-packages\ThrustRTC\Context.py", line 66 in launch_n
  File "...\test.py", line 6 in test_bug
  File "...\site-packages\_pytest\python.py", line 165 in pytest_pyfunc_call
  File "...\site-packages\pluggy\callers.py", line 187 in _multicall
  File "...\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "...\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "...\site-packages\pluggy\hooks.py", line 289 in __call__
  File "...\site-packages\_pytest\python.py", line 1451 in runtest
  File "...\site-packages\_pytest\runner.py", line 117 in pytest_runtest_call
  File "...\site-packages\pluggy\callers.py", line 187 in _multicall
  File "...\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "...\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "...\site-packages\pluggy\hooks.py", line 289 in __call__
  File "...\site-packages\_pytest\runner.py", line 192 in <lambda>
  File "...\site-packages\_pytest\runner.py", line 220 in from_call
  File "...\site-packages\_pytest\runner.py", line 192 in call_runtest_hook
  File "...\site-packages\_pytest\runner.py", line 167 in call_and_report
  File "...\site-packages\_pytest\runner.py", line 87 in runtestprotocol
  File "...\site-packages\_pytest\runner.py", line 72 in pytest_runtest_protocol
  File "...\site-packages\pluggy\callers.py", line 187 in _multicall
  File "...\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "...\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "...\site-packages\pluggy\hooks.py", line 289 in __call__
  File "...\site-packages\_pytest\main.py", line 278 in pytest_runtestloop
  File "...\site-packages\pluggy\callers.py", line 187 in _multicall
  File "...\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "...\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "...\site-packages\pluggy\hooks.py", line 289 in __call__
  File "...\site-packages\_pytest\main.py", line 257 in _main
  File "...\site-packages\_pytest\main.py", line 213 in wrap_session
  File "...\site-packages\_pytest\main.py", line 250 in pytest_cmdline_main
  File "...\site-packages\pluggy\callers.py", line 187 in _multicall
  File "...\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "...\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "...\site-packages\pluggy\hooks.py", line 289 in __call__
  File "...\site-packages\_pytest\config\__init__.py", line 74 in main
  File "C:\Program Files\JetBrains\PyCharm 2019.1.1\helpers\pycharm\_jb_pytest_runner.py", line 45 in <module>

Process finished with exit code -1073741819 (0xC0000005)

HTH

Possibility of doing FFT and representing Complex Numbers with ThrustRTC

Very useful library !

I am already doing some computation with ThrustRTC and would like to know if we can access CUDA FFT functionality from ThrustRTC itself - with minimal extra code ?

An example CPP method using Thrust is shown here

Also related question, what is the best practice for representing Complex Numbers in DV ? I currently separate them as two Float vectors and use. I do see DVTuple, but can we do a DVTupleVector ?

Sequence() works on Compute Capability 6.1, but doesn't work on 8.6

Trying out the example code from the Thrust Quick start guide:

import ThrustRTC as trtc

X = trtc.device_vector('int32_t', 10)
Y = trtc.device_vector('int32_t', 10)
Z = trtc.device_vector('int32_t', 10)

# initialize X to 0,1,2,3, ....
trtc.Sequence(X)

I get expected result on a machine with compute capability 6.1 (checked with https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549) but it fails on machine with CC=8.6 with the following error

Traceback (most recent call last):
  File "/home/test/test.py", line 8, in <module>
    trtc.Sequence(X)
  File "/usr/local/lib/python3.9/dist-packages/ThrustRTC/Transformations.py", line 34, in Sequence
    check_i(native.n_sequence(vec.m_cptr, cptr_init, cptr_step))
  File "/usr/local/lib/python3.9/dist-packages/ThrustRTC/Native.py", line 16, in check_i
    raise SystemError("An internal error happend")
SystemError: An internal error happend

Some other examples from the Quick Start Guide work on this machine. What other information should I provide?

Thanks for help.

provide ThrustRTC.version

would be useful

a case of lack of type specifiers in header_of_structs.h

@fynv, we are experiencing some weird regression (no idea if related to ThrustRTC or client code) and the symptoms are that previously working code started to fail to compile with errors pointing to lack of type specifiers in the header_of_structs.h, e.g.

     struct _S_d094442c7e83bcae
     {
         typedef _S_d094442c7e83bcae CurType;
          view_vec_in;
          view_vec_out;
         template<class _T0>
         __device__ inline auto operator()(const _T0& idx)
         {
         view_vec_out[idx]=(decltype(view_vec_out)::value_t)view_vec_in[idx];
         }
     };

(note that view_vec_in and view_vec_out have no types).

I'll be working on a minimal reproducer, meanwhile thought I'd post here to check if this doesn't perhaps ring a bell ...
Thanks

ThrustRTC on Colab

Do you have any idea how to fixed this error (I want to test ThrustRTC on Google Colab)

segfault in reduction (due to NULL returned from native.n_reduce())

in api_Reductions.cpp, the native.n_reduce may return a nullptr:

ThrustRTC/python/api_Reductions.cpp

Line 56 in 6b40fc0

return nullptr;

but it is not handled in Reductions.py:

ThrustRTC/python/ThrustRTC/Reductions.py

Line 18 in bd26f89

dvptr = native.n_reduce(vec.m_cptr, cptr_init, cptr_binary_op)

and a segfault is triggered within ValueFromDVPtr

HTH

Raise "NVRTC version too high for GPU (compute_7)" on RTX2060 when "Sort_By_Key"

My code:

Error happens in "trtc.Sort_By_Key"
Std::out:

My GPU information:
RTX 2060 12GB; CUDA version: 11.6.124; Compute Capability: (7, 5): 64
My ThrustRTC version: 0.3.19

What other information should I provide?
Thanks for you help.

Give more useful debug feedback for internal error

It would be helpful if the error message an internal error happend reported in check_i in ThrustRTC/Native.py could provide more useful debug feedback.

For instance, is it an error with argument length/type? Is there an illegal operation in the kernel? Is there a typecast error in the kernel?

non-CUDA usage? (serial, multi-threded)

Thrust offers serial and multi-threaded execution. Are there plans to support it also with ThrustRTC? Thanks

Inability to do arithmetic operations for different argument types - bug or feature?

import ThrustRTC as trtc

darr_in1 = trtc.device_vector_from_list([1 ], 'int32_t')
darr_in2 = trtc.device_vector_from_list([1 ], 'int64_t')
darr_out = trtc.device_vector('int64_t', 1)

trtc.Transform_Binary(darr_in1, darr_in2, darr_out, trtc.Plus())

This code will produce an error

header_of_structs.h(11): error: no instance of function template "Plus::operator()" matches the argument list
            argument types are: (int32_t, int64_t)
            object type is: Plus
          detected during instantiation of "auto _S_780244a76f854dd5::operator()(const _T0 &) [with _T0=size_t]" 
saxpy.cu(13): here

1 error detected in the compilation of "saxpy.cu".

Is it a bug or a feature? :)
Thanks

ThrustRTC example failed execution - Loading libnvrtc failed. cuMemAlloc() failed with Error code: 2

I have a working CUDA 11.3 environment with GTX-1660Ti, Thrust library also tested working.

I installed ThrustRTC-0.3.11 with pip and tried to execute the example code given on the github page. It fails with the above message. Any suggestions/fixes ??

About mutliprocess

Hi, i have create a class which use to sample graph node, the code include ThrusrRTC and CURandRTC. When i put it in multiprocessing.spawn(func, args=('myclass')), I got error.
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_cffi_backend._CDataBase' object
I have write demo in https://github.com/zsy0828/trtc_question.
If you have spare time, could you tell me how to let it work in multiprocess, thank a lot

About python trtc.Functor

hello,
I'm a new in learning thrust, and i'm really interesting about yout project, but I have some questions about trtc.Functor.

I wonder can class trtc.Functor support curand_init() or rand() to generate a random number.

i'd really appreciate if you reply me.
thank you very much!

raise an exception in case of incompatible CUDA version

As detailed in #19, there is a need to raise an error in case of incompatible CUDA hardware/software so that a user knows that there is aneed to upgrade/downgrade CUDA.

The problematic setups are:
A. Volta GPU (or higher) + CUDA 8.0 or lower, where compute_70 hits the upper-bound of CUDA
B. Kepler GPU (or lower) + CUDA 10.2 or higher, where compute_30 hits the lower bound of CUDA

Here's a Python snippet that might help:
https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549

failing with exception in case CUDA not available

Currently, even a ThrustRTC hello-world like

import ThrustRTC
a = ThrustRTC.device_vector_from_list([1, 2, 3, 1, 2], 'int32_t')

when run on a machine without CUDA, fails ungently halting python:

libcuda.so not found
Cannot find CUDA driver. Exiting.

It would be great to raise an exception so that a user can catch it.

Perhaps it would even be better to report it by the import statement so one can try: import ThrustRTC (same for CURandRTC)

Thanks

update conda package

Would be great to update the conda packages available at fyplus channel here:
https://anaconda.org/fyplus/thrustrtc
(currently v0.2.0 is available only)
Thanks

how to cite ThrustRTC? :)

Question: JCuda integration

Hello,

Thanks a lot for the library.

I wondered if you think one could integrate JCuda (https://github.com/jcuda/jcuda-main) in your java API? From my understanding, the device variables should be interchangeable between ThrustRTC and JCuda?

For the context, I am interesting to use ThurstRTC Java API from Clojure.

Best regards,
David

import fails after fresh install with pip (/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found)

As reported originally here: open-atmos/PySDM#513 (comment)

We have a machine on which a fresh install of ThrustRTC using pip results in unusable package (Python 3.8.5, pip 20.3.1, ThrustRTC 0.3.10, cffi 1.14.5):

python3 -c "import ThrustRTC as trtc"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ajaruga/.local/lib/python3.8/site-packages/ThrustRTC/__init__.py", line 1, in <module>
    from .Native import native, check_i
  File "/home/ajaruga/.local/lib/python3.8/site-packages/ThrustRTC/Native.py", line 12, in <module>
    native = ffi.dlopen(path_thrustrtc)
OSError: cannot load library '/home/ajaruga/.local/lib/python3.8/site-packages/ThrustRTC/libPyThrustRTC.so': /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/ajaruga/.local/lib/python3.8/site-packages/ThrustRTC/libPyThrustRTC.so)

Some Googleing suggests that it is a glibc version incompatibility with the symbols used within the precompiled .so object shipped with pip, but then it should likely be reported by pip (i.e. pip should prevent from installing it), right?

Any hints?
Thanks

release/packaging syncinc

As of today:

Github releases page (https://github.com/fynv/ThrustRTC/releases) lists 0.3.4 as the latest release
PyPI (https://pypi.org/project/ThrustRTC/) lists 0.3.13 as the latest release
Conda-forge (https://anaconda.org/fyplus/thrustrtc/files) lists 0.3.11 as the latest release

It would be great to automate somehow the release process - likely less work for the maintainer, less confusion for users, less issues with outdated packages to handle.

better error handling for 0-size vector creation

as of now, passing a zero value for vector size causes the following behaviour:

import ThrustRTC as trtc
trtc.device_vector('double', 0)
cuMemAlloc() failed with Error code: 1
Error Name: CUDA_ERROR_INVALID_VALUE
Error Description: invalid argument
Out[4]: <ThrustRTC.DVVector.DVVector at 0x7f8bca0b1d90>

"SystemError: An internal error happend" When using Exclusive_Scan

Hi,

I'm getting an error when using Exclusive_Scan with python

My code:

def process_query_vector_degree_array(E):
    """
    Processes the query vector degree array into a form that can be used to create the cluster graph

    Creates a new vector V which stores the cumulative sum of the degree array E. This is used to find the starting index of the neighbourhood vectors for each query vector when creating the cluster graph.

    Inspired by step 2 of the G-DBSCAN algorthm. 

    Uses the ThrustRTC library to perform an exclusive scan

    Args:
        E: A CuPy Array. The degree array of the query vectors. Has shape (len(X), 1)
    
    Returns:
        The processed E array. As a cupy array
    """
    E_d = trtc.DVCupyVector(E)
    V_d = trtc.DVCupyVector(cp.zeros(len(E), dtype=cp.int32))
    trtc.Exclusive_Scan(E_d, V_d)
    return cp.array(V_d.to_host())


# In my test file:
E = cp.array([3, 4, 1, 1], dtype=cp.int32)
V_trtc = gsd.process_query_vector_degree_array(E)
```

The error:

```
# My own files above (I've removed my own directories)
File "~/Documents/Thesis/./src/gs_dbscan.py", line 270, in process_query_vector_degree_array
    trtc.Exclusive_Scan(E_d, V_d)
  File "~/Documents/Thesis/gs_venv/lib/python3.8/site-packages/ThrustRTC/PrefixSums.py", line 16, in Exclusive_Scan
    check_i(native.n_exclusive_scan(vec_in.m_cptr, vec_out.m_cptr, cptr_init, cptr_binary_op))
  File "~/Documents/Thesis/gs_venv/lib/python3.8/site-packages/ThrustRTC/Native.py", line 16, in check_i
    raise SystemError("An internal error happend")
SystemError: An internal error happend
```

FYI My CUDA version is 11.3