bheisler / rustacuda Goto Github PK

View Code? Open in Web Editor NEW

762.0 23.0 58.0 1.22 MB

Rusty wrapper for the CUDA Driver API

License: Apache License 2.0

Rust 99.49% Cuda 0.09% Shell 0.42%

cuda rust cuda-api gpu

rustacuda's People

Contributors

Stargazers

Watchers

Forkers

rusch95 mstallmo publicroman andrewgaspar dineshadepu denzp lutzcle tdaffin mypmc alphastrata isgasho bumzack stupidhuman chipsspectre gin66 eisterman jdtatz 4ier ahmed-masud debojyoti-majumder samsartor herovault jzarnett nobles5e age-rs aselsiriwardena dizans such-research nilsmartel dignifiedquire zeta1999 matthias-fauconneau yjw2021 thehamsta vmx mfkiwl futpib pilyushkin filecoin-project codes1gn chenhuayou saona-raimundo mikevoronov jerry73204 frankfanslc williamvenner ksong0xd winderica randyaiexpert tianyuanzhangnb denwong47 pinkforest rstkit

rustacuda's Issues

Hip support?

AMD's HIP library offers at least a subset of CUDA operations, many of which have similar name and API. How feasible would it be to add this as a feature, to create a similar RustaCUDA API for high level programming with HIP? I'm thinking it would be a feature that creates an additional hip module that exposes a HIP version. This would make porting / targeting both platforms easier.

How should I use constant memory

In C++ "cudaMemcpyToSymbol" can copy data from the CPU to the GPU's constant memory in rustacuda What should I do

Context explanation and how to use rustacuda::quick_init()?

I am trying to figure out the context declaration dynamic and some explanation in the documentation would be great.

In the main example (in README.md), when compiling the Rust code, context seems to be never used and the compiler suggest you to replace it for "_context", which also works. I assume it is called in the device somewhere?
rustacuda::quick_init() has no example in the documentation, so after a while I discovered that "let context = rustacuda::quick_init()?;" is how to use it.
It seems that the context has to be defined as a variable in main(), but I still don't know why :/

Thanks!!

Which types should implement Send/Sync?

I'm a bit fuzzy on how these traits behave and when it's appropriate to implement them. I think most of the types in RustaCUDA don't implement them since most of the types contain a raw pointer, but the CUDA API is thread safe so it should be safe to share most types between threads.

Update prelude in API documentation, old re-exports

It seems to me that the re-exports in
https://bheisler.github.io/RustaCUDA/rustacuda/prelude/index.html
are outdated.

I tried to use DeviceBox, which is in rustacuda::memory::, together with DeviceBuffer, but I couldn't.
I had to include
pub use rustacuda::memory::;

Moreover, when replacing
use rustacuda::prelude::*;
by

pub use crate::context::Context;
pub use crate::context::ContextFlags;
pub use crate::device::Device;
pub use crate::module::Module;
pub use crate::stream::Stream;
pub use crate::stream::StreamFlags;
pub use crate::CudaFlags;

I could not use DeviceBuffer any longer.

Thanks and cheers!!

How to run the example with add.cubin file?

I'm trying to run the example with the add.cubin kernel but it seems not to work.

I've tried
"...
let module_data = CString::new(include_str!("../add.cubin"))?;
let module = Module::load_from_string(&module_data)?;
..."

but I got
error: src/../add.cubin wasn't a utf-8 file

the I've tried
"...
let filename = CString::new("./sub.cubin")?;
let module = Module::load_from_file(&filename)?;
..."
but I got
Error: InvalidSouce

Adding support for getting the UUID

Currently there is no way to get the UUID of a device. I'd like to add this feature.

It's a matter of calling cuDeviceGetUuid(), which is already part of cuda-sys. There is also a cuDeviceGetUuid_v2(), but that's a very recent addition (CUDA 11.4), so I would ignore it for now as cuda-sys is on CUDA 10.2 anyway.

I would implement Device::uuid(), similarly to the existing Device::name().

Though cuDeviceGetUuid() is only available on cuda-driver-sys 0.3, which seems to be the successor of cuda-sys 0.2. Upgrading to it seems to work with minimal changes, I can also provide a PR for that (my current version of those changes is at https://github.com/vmx/RustaCUDA/tree/uuid).

Cant find -lcudart

Hi all.

I want to contribute to this project. I just got my gpu. I installed nvidia drivers and cuda tool kit.

 --------------------------------------
 dinesh@LAPITOP (master #) /home/dinesh/prog/rust/cuda_rust_check $  
|  Dell Laptop=> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

I added cuda libraries as given in readme.

# cuda installation
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_LIBRARY_PATH="/usr/local/cuda-10.0/lib64"

I want to show that the libraries are available at here

 --------------------------------------
 dinesh@LAPITOP /usr/local/cuda $  
|  Dell Laptop=> ls
bin  extras   lib64      libnvvp  NsightCompute-1.0  nvml  README   share  targets  version.txt
doc  include  libnsight  LICENSE  nsightee_plugins   nvvm  samples  src    tools

Now when I run the example given in readme I get,

 --------------------------------------
 dinesh@LAPITOP (master #) /home/dinesh/prog/rust/cuda_rust_check $  
|  Dell Laptop=> cargo run
   Compiling cuda_rust_check v0.1.0 (/home/dinesh/prog/rust/cuda_rust_check)                             
warning: unused variable: `context`                                                                      
  --> src/main.rs:26:9                                                                                   
   |                                                                                                     
26 |     let context = Context::create_and_push(                                                         
   |         ^^^^^^^ help: consider using `_context` instead                                             
   |                                                                                                     
   = note: #[warn(unused_variables)] on by default                                                       
                                                                                                         
error: linking with `cc` failed: exit code: 1                                                            
  |                                                                                                      
  = note: "cc" "-Wl,--as-needed" "-Wl,-z,noexecstack" "-m64" "-L" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.13petch2da205eam.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.15qy0hkw3l35pdz4.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.1fxjl7nwu306h06k.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.1ptig0mbtxnotkno.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.2dabi26ecn60fl8i.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.2n2enk10fs439jlk.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.34ctdxiv3nfttgb4.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.34f58v21ywaf7ohy.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.3780ljs8ayc2aiqg.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.3c3ircaumg4xrxhw.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.3lcy9r8jbsgr04o0.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.3ml5jq9nqmen3vcl.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.3q8pb8nreiz32fxh.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.3tp77dxjsl4d1bgg.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.3y4wdr0anphrplti.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.44d9st8icqi9qevr.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.44qnnjyev922p73d.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.45zaqay3vbv772ju.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.4gbzt0pjzc91a4t9.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.4hksnxj1hkidpniz.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.4hymhr3qyann8n1d.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.4ku37wlybl4oarhv.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.524inxqvoeq5t16p.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.572wp8cmmtuc6c9x.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.5gt6z5qyk8o63tal.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.hf5soq910o5o67p.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.ipp32by2pbsinsm.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.kn66fwvhog07cs9.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.kpe4hfqqymqoak8.rcgu.o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.wdhfe6pye6ovnz9.rcgu.o" "-o" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/cuda_rust_check-ff11869f86f3bb6d.1tut32buw17s5spe.rcgu.o" "-Wl,--gc-sections" "-pie" "-Wl,-zrelro" "-Wl,-znow" "-nodefaultlibs" "-L" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps" "-L" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/librustacuda-565ea4272efbeddd.rlib" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/librustacuda_core-8314ac49dc75c201.rlib" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/libcuda_sys-a90588f89cd8eefb.rlib" "/home/dinesh/prog/rust/cuda_rust_check/target/debug/deps/libbitflags-de5961d905541c03.rlib" "-Wl,--start-group" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-52f862a21e09568c.rlib" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libpanic_unwind-b89f1a9b548bfef9.rlib" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc_jemalloc-a7f738f7aa980965.rlib" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libunwind-4c9cbe622de116ab.rlib" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc_system-48c8a8552cff934e.rlib" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liblibc-517830e9d56bab97.rlib" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc-2d44723f32308cf8.rlib" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-f2133e8b70369157.rlib" "-Wl,--end-group" "/home/dinesh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-71671e8958739d51.rlib" "-Wl,-Bdynamic" "-lcuda" "-lcudart" "-lcublas" "-ldl" "-lrt" "-lpthread" "-lpthread" "-lgcc_s" "-lc" "-lm" "-lrt" "-lpthread" "-lutil" "-lutil"
  = note: /usr/bin/ld: cannot find -lcudart                                                              
          /usr/bin/ld: cannot find -lcublas                                                              
          collect2: error: ld returned 1 exit status                                                     
                                                                                                         
                                                                                                         
error: aborting due to previous error                                                                    
                                                                                                         
error: Could not compile `cuda_rust_check`.                                                              

To learn more, run the command again with --verbose.

How to fix this?

Typo in README

In the Contributing section of the README, it says "becuase".

Enable Rust 2018

Make sure that everything compiles in Rust 2018 mode, then convert the code and all examples/rustdoc comments/tests/etc. to that mode.

Add support for textures and surfaces

I'm not entirely sure how these two are related, but RustaCUDA should support them.

Texture Reference Management
Texture Object Management
Surface Reference Management
Surface Object Management

`LockedBox` is missing; there is no way to asynchronously memcpy `DeviceBox` D2H or H2D

Maybe this is by design, but there is currently no way to asynchronously memcpy DeviceBox

DeviceBox's AsyncCopyDestination implementation only allows for device-to-device copies, there is no way to do a device-to-host or host-to-device asynchronous memcpy for a DeviceBox.

What should be `#[inline]`d?

I'm not sure if RustaCUDA functions should be inlined or not, and if so, which ones?

Add nightly-Rust build to Travis

Forgot to do this when I was setting up CI initially. Could maybe try adding a Mac OS build too, but I have no idea if they support CUDA at all.

Question: Problem finding lcudart and lcublas when building example Ubuntu 18.04

I've just been poking around with the example from https://bheisler.github.io/post/announcing-rustacuda/ and I keep running into some build issues with cuda-rs. I've set $CUDA_LIBRARY_PATH to the location of CUDA on my system (/usr/local/cuda/lib64) and when running cargo build I get the following error:
note: /usr/bin/ld: cannot find -lcudart /usr/bin/ld: cannot find -lcublas collect2: error: ld returned 1 exit status

Any help would be greatly appreciated!

Mixing cublas and rustacuda ?

Can we please have sample code that

allocates some memory
calls A = B * C
calls some kernel on A
calls sgemm D = E * A

?
I have some tensor code that runs great in CPU mode, but fails in GPU mode (so the algorithm si correct). All CPU vs GPU unit tests pass -- so it seems I am running into a synchronization issue.

I am using stream.synchronize on after all kernel calls -- so it seems the remaining culprit is that kernels on streamA while cublas is on streamB .. and it's not clear to me how to synchronize the two.

Idea: Mimic `std::boxed::Box` API as closely as possible

Apologies, this seems to have become rather long. It seemed like such a simple idea at the start!

There are four main tracts to this idea:

Allow DeviceBox<[T]>, making DeviceBuffer just an alias that could be deprecated in future.
Make the interface safer by using MaybeUninit for uninitialized/zeroed allocations on the device.
Add an Alloc generic parameter to DeviceBox, allowing for various new type of allocation.
Bonus: Add support for async allocations.

I think this proposal is entirely backwards compatible, though it does introduce some methods that are very similar to existing, e.g. new_unit vs. uninitialized, new_zeroed vs. zeroed.

`DeviceAllocator`

// new `alloc` module

pub trait DeviceAllocator {
    type Ptr;

    fn allocate(&self, size: usize) -> CudaResult<Ptr>;
    // This allows for asynchronous zeroing.
    fn allocate_zeroed(&self, size: usize) -> CudaResult<Ptr>;
    fn deallocate(&self, ptr: Ptr) -> CudaResult<()>;
}

// Uses `cudaMalloc`, `cudaFree`.
pub struct Global; // TODO better name?

impl DeviceAllocator for Global {
    type Ptr = DevicePointer<u8>;
    ...
}

// Other allocators might include:
// `Unified`, `HostPinned`, `Pitched`, `Async`, `MemoryPool`, etc.

pub struct DeviceBox<T, A: DeviceAllocator = Global> {
    ptr: A::Ptr,
    alloc: A,
}

impl<T, A> DeviceBox<T, A> {
    pub fn new_in(x: T, alloc: A) -> DeviceBox<T, A>;
}

`MaybeUninit`

impl<T> DeviceBox<T, Global> {
    ...
    // Note that these methods are safe.
    pub fn new_uninit() -> DeviceBox<MaybeUninit<T>, Global>;
    pub fn new_zeroed() -> DeviceBox<MaybeUninit<T>, Global>;

}

impl<T, A> DeviceBox<T, A> {
    ...
    pub fn new_uninit_in(alloc: A) -> DeviceBox<MaybeUninit<T>, A>;
    pub fn new_zeroed_in(alloc: A) -> DeviceBox<MaybeUninit<T>, A>;
}

impl<T, A> DeviceBox<MaybeUninit<T>, A> {
    pub unsafe fn assume_init(self) -> DeviceBox<T, A>;

    // Use this for kernel outputs, then `assume_init` after the kernel is complete.
    pub unsafe fn as_uninit_device_pointer(&mut self) -> DevicePointer<T>;
}

`DeviceBox<[T]>`

impl<T> DeviceBox<[T], Global> {
    pub fn new(x: &impl AsRef<[T]>) -> DeviceBox<[T], Global>;
    pub fn new_uninit_slice() -> DeviceBox<[MaybeUninit<T>], Global>;
    pub fn new_zeroed_slice() -> DeviceBox<[MaybeUninit<T>], Global>;
}

impl<T, A> DeviceBox<[T], A> {
    pub fn new_in(x: &impl AsRef<[T]>, alloc: A) -> DeviceBox<[T], A>;
    pub fn new_uninit_slice_in(alloc: A) -> DeviceBox<[MaybeUninit<T>], A>;
    pub fn new_zeroed_slice_in(alloc: A) -> DeviceBox<[MaybeUninit<T>], A>;
}

impl<T, A> DeviceBox<[MaybeUninit<T>], A> {
    pub unsafe fn assume_init(self) -> DeviceBox<[T], A>;
    pub unsafe fn as_uninit_device_pointer(&mut self) -> DevicePointer<T>;
}

`Async`

// Uses `cudaMallocAsync`, `cudaFreeAsync`.
pub struct Async<'a> {
    stream: &'a Stream,
}

impl Async<'_> {
    pub fn on(stream: &'a Stream) -> Async<'a>;
}

impl<'a> DeviceAllocator for Async<'a> {
    type Ptr = DevicePointerAsync<'a, u8>;
    ...
}

pub struct DevicePointerAsync<'a, T> {
    ptr: DevicePointer<T>,
    stream: &'a Stream,
    is_allocated: Event,
}

impl<T, A> DeviceBox<T, A>
where
    A: DeviceAllocator,
    A::Ptr = DevicePointerAsync<'_, T>,
{
    // If the stream matches the async pointer, return it immediately.
    // Otherwise, block `stream` on `is_allocated` event.
    pub fn as_device_pointer_on(&mut self, stream: &Stream) -> DevicePointer<T>;
    pub unsafe fn as_device_pointer_unchecked(&mut self) -> DevicePointer<T>;
}

impl<T> DeviceBox<T, Async<'_>> {
    // Wait for `is_allocated` event.
    pub fn synchronize(self) -> DeviceBox<T, Global>;
}

launching without launch! macro

I have a weird bug (no minimal failure case yet) and I'm trying to de-magicfy each step.

Is there a full example somewhere of launching a kernel WITHOUT using the launch! macro?

DeviceBuffer allocation doesn't complete before kernel launch?

I've just started playing with RustaCUDA, and am relatively new to Rust, so apologies if there's something obvious I'm missing here, but it seems like DeviceBuffer allocations are happening asynchronously in my code. Here's a reduced test case (adapted from this blog post):

// main.rs
use rustacuda::launch;
use rustacuda::prelude::*;
use std::error::Error;
use std::ffi::CString;

fn main() -> Result<(), Box<dyn Error>> {
  
    let _context = rustacuda::quick_init()?;
    let module = Module::load_from_string( &CString::new(include_str!("add.ptx"))? )?;
    let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?;

    let n = 100000;
    let x_host: Vec<f32> = (0..n).map(|n| n as f32).collect();
    let y_host: Vec<f32> = (0..n).map(|n| n as f32).collect();
    let mut result_host: Vec<f32> = (0..n).map(|_| 0.0).collect();
    
    let mut x_device = DeviceBuffer::from_slice(&x_host)?;
    let mut y_device = DeviceBuffer::from_slice(&y_host)?;
    let mut result_device = DeviceBuffer::from_slice(&result_host)?;

    //std::thread::sleep( std::time::Duration::from_millis(100) );

    unsafe {
        launch!(module.add<<<(14, 14, 1), 512, 0, stream>>>(
            n,
            x_device.as_device_ptr(),
            y_device.as_device_ptr(),
            result_device.as_device_ptr(),
            result_device.len()
        ))?;
    }

    stream.synchronize()?;
    result_device.copy_to(&mut result_host)?;
    
    println!("x end: {:?}", &x_host[n as usize - 5..]);
    println!("y end: {:?}", &y_host[n as usize - 5..]);
    println!("result end: {:?}", &result_host[n as usize - 5..]);
  
    Ok(())
}

The kernel just takes two vectors and pair-wise sums them into an output vector:

// add.cu
extern "C" __global__ void add(int n, float *x, float *y, float *out) {
    long blockId = blockIdx.z  *  gridDim.x*gridDim.y
                 + blockIdx.y  *  gridDim.x
                 + blockIdx.x;
    long threadsPerBlock = blockDim.x;
    long i = blockId * threadsPerBlock + threadIdx.x;

    if (i < n) {
        out[i] = x[i] + y[i];
    }
}

Version info:

rustacuda = "0.1"
rustacuda_derive = "0.1"
rustacuda_core = "0.1"

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Graphics Card: GTX 2070 (mobile/laptop version)

When I compile and run with this command:

nvcc src/add.cu --ptx -o src/add.ptx --gpu-architecture=compute_75
env LIBRARY_PATH=/usr/local/cuda/lib64 cargo +nightly run --release

There's about a 50% chance that I get this as the output:

x end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
y end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
result end: [199990.0, 199992.0, 199994.0, 199996.0, 199998.0]

And a 50% chance that I get this:

x end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
y end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
result end: [0.0, 0.0, 0.0, 0.0, 0.0]

And as you can see in that last example, the kernel additions are not reaching the end of the vector. When it doesn't reach the end it tends to get to around 70,000 out of 100,000 elements, give or take 10,000 (it's seemingly random within that range).

But if you uncomment the std::thread::sleep line, then everything works fine 100% of the time. So it seems like there's some sort of race condition here?

Expose the rest of the Driver API

RustaCUDA currently exposes only a minimum-viable subset of the Driver API. Ultimately, it should expose all non-deprecated public functions in the API.

This is a tracking issue to keep track of which aspects of the API should be available but aren't. If you need something that isn't listed here, add a comment and raise an issue.

Events: #6
Primary Context Management: #7
Asynchronous memcpy: #8
JIT Linking: #9
Multi-dimensional arrays: #10
Texture/surface objects: #11
Expose the rest of the Unified memory management API: #12
Stream callbacks: #13

There are more features to add (including features only added after CUDA 8.0) but this should be enough for now.

Add support for Events

Build a safe, rusty wrapper for the Events module.

Don't expose the cuStreamBatchMemOp, cuStreamWaitValue32 or cuStreamWriteValue32 functions, as they're deprecated in version 10 anyway.

DeviceBuffer::drop force always succeed ?

Quoting: https://github.com/bheisler/RustaCUDA/blob/master/src/memory/device/device_buffer.rs#L132-L172

    /// Deallocating device memory can return errors from previous asynchronous work. This function
    /// destroys the given buffer and returns the error and the un-destroyed buffer on failure.
...
    pub fn drop(mut dev_buf: DeviceBuffer<T>) -> DropResult<DeviceBuffer<T>> {

The fact that drop can fail is slightly problematic as I can't figure out how to use it with RAII. In particular, when there are no more references to a DeviceBuffer, I want it to free the GPU memory. However, if this drop can fail, I can't guarantee that the GPU Memory is freed.

What is the right way to handle freeing a DeviceBuffer ? (Again, the fact that cuFree can fail is very surprising to me.)

timing / benchmarking kernels from RustaCUDA code

Can we please get examples for benchmarking / timing kernels via RustaCUDA ?

I'm not familiar with how to benchmark CUDA code and would love to learn from examples.

HIP support for CUDA/ROCm cross-compatibility?

This may well be out of scope, but I couldn't find any prior discussion.

Nvidia lock-in is a problem, but we haven't had much choice in scientific computing; Nvidia has a great software stack, which it optimizes at the assembly level. For example, the fastest OpenCL BLAS libraries still lag significantly behind Nvidia's cuBLAS.

To compete with CUDA, AMD has shifted from OpenCL to its ROCm platform. AMD is also developing a thin "HIP" compatibility layer that compiles to either CUDA or ROCm. AMD's hipBLAS, hipSPARSE, and hipDNN all translate to the cu- or roc- equivalents, depending on hardware target. So, for example, hipBLAS would link to either cuBLAS or rocBLAS.

On the hardware side, AMD's Radeon VII now looks competitive with, e.g. Nvidia's 2080 Ti. In particular, its double precision performance is far better (theoretical 3.5 TFLOPS vs 0.37 TFOPS). TensorFlow (HIP version) benchmarks on AMD also look pretty good at single precision.

I wonder if the RustaCUDA community has given any thought to targeting HIP rather than CUDA directly. Would this be a more-or-less automatic source conversion (replace cuda- with hip- prefixes everywhere)? Could this give easy access to both CUDA/ROCm, without sacrificing performance? Could the Rust language somehow give us an advantage for cross-platform GPGPU support? I would be very interested to hear your thoughts.

[EDIT] Just realized the ROCm is Linux-only for now. This is a bummer. Windows support might be on the way. No ETA on Mac support.

Add support for asynchronous memcpy

Copying memory asynchronously allows it the memcpy to overlap with other work as long as the work doesn't depend on the copied data. This is important for optimal performance, so RustaCUDA should provide access to it.

is Default stream existed?

I created two stream with NON_NLOCKING flag as below, to test host2dev memory copy and kernal run concurrency. 
unfortunately,  They seems still execute in order. if there is default stream stay in system or the concurrency not support now?  thanks!


 let stream_0 = Stream::new(StreamFlags::NON_BLOCKING, None)?;
 let stream_1 = Stream::new(StreamFlags::NON_BLOCKING, None)?;

Support for Vector Types

Would you consider adding built-in support for the vector types used in kernels like float4 and the other built-in vector types for char, short, int, long, longlong, float, double listed in the CUDA Documentation?

If I want to use them, I can make my own definition or import them from another library, but I cannot use them in a direct, convenient way because the types don't implement the DeviceCopy trait that would be needed for sending it over to the kernel.

My current workaround looks something like this, because of course I can't add traits to a type that isn't mine.

struct CudaFloat4(float4);
unsafe impl DeviceCopy for CudaFloat4 {}
impl Deref for CudaFloat4 {
    type Target = float4;

    fn deref(&self) -> &Self::Target {
        &self.0
    }
}

Also, when I am creating one, there's a tiny bit of extra ceremony in that I have to add the 0: float4 { } around the definition, such as:

result.push(CudaFloat4 {
            0: float4 {
                x: rng.gen_range(0.0, SPACE),
                y: rng.gen_range(0.0, SPACE),
                z: rng.gen_range(0.0, SPACE),
                w: rng.gen_range(0.01, 100.0),
            }
        });

Although I'm not an expert in CUDA, I imagine these vector types are likely to be commonly used and it would be convenient for users of the library to have these supported directly in RustaCUDA. I do have a full (working) example with my workaround as above if more is needed. Thanks!

(Please pardon my ignorance if this is implemented somewhere in RustaCUDA and I've failed to find it.)

Droping (module and streams) ... and ... (context and device)

For Modules and streams, the problem is that DropResult type can not be used with "?".
For example, if I add the following line

Stream::drop(stream)?;

after "stream.synchronize()?;" in the example in README.md, I got
"the trait std::error::Error is not implemented for '(rustacuda::error::CudaError, rustacuda::prelude::Stream)' "
I don't know how to implement this yet, sorry, Rust beginner.
It is easy right?

In the other hand, when I want to finish the program and explicitly exit a context, it tries to deallocate cuda memory after dropping the context.

For example, if I add these lines:

match Context::drop(context) {
Ok(()) => println!("Successfully destroyed"),
Err((e, _ctx)) => {
    println!("Failed to destroy context: {:?}", e);
    // Do something with ctx
    },
}

at the end of the example in README.md and it does compile, and it does run, but fails at the very end:

context is dropped and "Successfully destroyed" is printed, but
then tries to deallocate some memory in CUDA (I assume some routine at the end of the program).

In terms of device, there is no method to drop a device.
There should be one, right?

Remove `unwrap` from examples

We shouldn't be encouraging users to unwrap errors, but to use ? instead. Go through all of the examples in the Rustdoc comments and change them to use ? instead of unwrap(), like so:

# use std::error::Error;
# pub fn main() -> Result<(), Box<Error> {
// Example goes here
extern crate rustacuda;
let _ctx = rustacuda::quick_init()?;
# }

Add support for CUDA arrays

CUDA supports complex strided, multidimensional arrays when performing memory transfers. I'm not really sure what they're used for or how they work, but RustaCUDA should support them.

It may also be nice to support copying to/from ndarray, if that's feasible.

See the Memory module for more.

DeviceBox for Vectors of data

Hey,

I just discovered this project and it looks really interesting!

The way how data is allocated on the GPU device looks like this for a scalar value.

let mut factor1 = DeviceBox::new(&6.0f32)?;

How can we allocate a vector of values on the device?
For example: If we want to increment all values in a vector [1,2,3],
how can this vector be transferred to the device?

It would be great for me to to add this information to the API documentation
in order to help future users.

Best regards
Chips

Expose Stream Callbacks

CUDA streams can enqueue a host-side callback which will be called when the stream reaches that point in the work queue. RustaCUDA does not currently support this.

Stream Management

Note that we'll have to ensure that the callback code doesn't allow panics to reach the CUDA driver.

Is 0.1.3 published?

The docs.rs shows the most recent version is 0.1.3, but I cannot find this version in this repository. Are the commits missing or published to another repo I'm not aware of?

How to do device-to-device memory allocations?

I am sorry if it is in the documentation, I can't find it in the DeviceBuffer section.

I would like to do something similar to the following instruction in CUDA:

cudaMemcpy(d_vec2, d_vec1, sizeof(d_vec1), cudaMemcpyDeviceToDevice);

which copies d_vec1 (on device) to d_vec2 (on device too).

Add complete support for Unified Memory

Right now, RustaCUDA only supports very basic usage of Unified Memory, but CUDA provides a complex API for pre-fetching data to a particular device, advising the driver about which device will use a range of data, and so on. RustaCUDA should expose this section of the API.

Unified Addressing

example passing float as an argument

I have a kernel:

extern "C"
__global__ void add_10(
    int n,
    float *dst, int dst_inc,
    float *a, int a_inc,
    float f ) {
  int i_start = blockIdx.x * blockDim.x + threadIdx.x;
  int i_inc = blockDim.x * gridDim.x;
  int i = i_start;

  while (i < n) {
    dst[i ] = a[i ]*2 + 100.0 + f;
    i += i_inc;
  }
}

I am trying to call it via:

    let result =
        unsafe { launch!(module.add_10<<<1, 1024, 0, stream>>>(
          4, c.device_ptr(), 1, a.device_ptr(), 1, 20.0 )) };

    let t = c.get();
    println!("status: {:?}, output: {:?}", result, t);

(I'm using my own "DeviceVector", not the builtin DeviceBuffer).

It seems that regardles of what I try to pass for argument "f", the result is always 0. It is as if any float I try to pass gets the value 0 in the kernel.

Is there any sample code on how to pass a floating point value to the kernel?

Figure out a way to make the context API truly safe

Maybe someone can find a way to make the Context API more safe than it is. I haven't been able to think of anything so far.

documentation: sync copy is using the default stream

I'm new to CUDA and I ran into a race condition which could perhaps be prevented with changes to the documentation.

The problem

Mixing the default stream and a custom stream isn't a good idea.

The implementations of the CopyDestination trait are implicitly using the default stream. When you launch a kernel on a stream that was created with the NON_BLOCKING flag, this can lead to a race condition.

My confusion

The documentation of the NON_BLOCKING stream flag has a good explanation about the default (NULL) stream. Though the sentence:

Since RustaCUDA does not provide access to the NULL stream, this flag has no effect in
most circumstances. However, it is recommended to use it anyway, as some other crate
in this binary may be using the NULL stream directly.

made me believe that as long as I use RustaCUDA, I should enable NON_BLOCKING and everything will be fine. The default stream is not used within the library, which is not true as mentioned above.

For me there were two solutions:

Either not setting the NON_BLOCKING stream flag, this way even if I launch the kernel on a custom stream (there is currently no way in RustaCUDA to launch it on the default stream), things would properly be synchronized.
I use the async copy methods on the same stream I launch the kernel on and synchronize the stream right after the copy operation (that's what I did).

Proposed fix

I propose adding a warning/info to the NON_BLOCKING stream flag documentation, that states that the synchronuous copy versions use the default stream and this setting might have an impact. Additionally I'd add information about the default stream to the CopyDestination trait itself.

RustaCUDA doesn't build on ARMv8

Background

Building RustaCUDA version 0.1.3 yields the following error when building on ARMv8:

error[E0308]: mismatched types
   --> /home/root/.cargo/registry/src/github.com-1ecc6299db9ec823/rustacuda-0.1.3/src/device.rs:376:48
    |
376 |             let mut cu_uuid = CUuuid { bytes: [0i8; 16] };
    |                                                ^^^ expected `u8`, found `i8`
    |
help: change the type of the numeric literal from `i8` to `u8`
    |
376 |             let mut cu_uuid = CUuuid { bytes: [0u8; 16] };
    |                                                 ~~

In the cuda_runtime.rs bindings and in the cuda.rs bindings from cuda-sys, these structs are raw c_char's:
cuda_runtime.rs:

#[repr(C)]
#[derive(Debug, Default, Copy, Clone, Hash, PartialOrd, Ord, PartialEq, Eq)]
pub struct CUuuid_st {
    pub bytes: [::std::os::raw::c_char; 16usize],
}

cuda.rs

#[repr(C)]
#[derive(Debug, Default, Copy, Clone, Hash, PartialOrd, Ord, PartialEq, Eq)]
pub struct CUuuid_st {
    pub bytes: [::std::os::raw::c_char; 16usize],
}

Note that this doesn't happen when building on x86.

Proposed fix:

Change to u8 as suggested

Introducing `Module::load_from_bytes()`

It's currently possible to directly embed a CUDA module via Module::load_from_string(). It is using a CStr, which is meant for a list on nul-terminated bytes (strings) and not for binary data.

Hence I propose introducing Module::load_from_bytes(), which does the same thing, the only difference will be that the input will be a byte slice &[u8]. You can then use the include_bytes! macro to include the module.

I would even remove the Module::load_from_string() and only have Module:load_from_bytes(), but I'd leave this decision to the maintainers.

I'm happy to do a PR, if that's a feature that would be accepted.

Add support for JIT linking

CUDA provides the ability to link together different modules at runtime. It's pretty niche, but it is there so we should expose it through RustaCUDA.

See the Module module

Add support for Primary Context Management

It's possible for applications to mix usage of the runtime and driver APIs, so we should expose the Primary Context API as well.

250ms rustacuda: setup/tear down for unit tests

I have a simple benchmark for setting up / tearing down rustacuda environments:

use super::*;
use rustacuda::prelude::Device;
use rustacuda::prelude::Context;
use rustacuda::prelude::ContextFlags;
use std::ffi::CString;

use stopwatch::{Stopwatch};

#[test]
fn test_00 () {

    let mut sw = Stopwatch::start_new();
    for i in 0..100 {
        {
            rustacuda::init(rustacuda::CudaFlags::empty()).unwrap();
            let device = Device::get_device(2).unwrap();
            let ctx = Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device).unwrap();

            let ptx = CString::new(include_str!("../out/main.ptx")).unwrap();
            let module = rustacuda::module::Module::load_from_string(&ptx).unwrap();
            let stream = rustacuda::stream::Stream::new(rustacuda::stream::StreamFlags::NON_BLOCKING, None).unwrap();
        }

        println!("Took: {}ms", sw.elapsed_ms());
        sw.restart();
    }
}

I get linesof the form:

Took: 314ms
Took: 235ms
Took: 242ms
Took: 246ms
Took: 249ms
Took: 238ms
Took: 239ms
Took: 240ms
Took: 263ms
Took: 251ms
Took: 255ms
Took: 241ms
Took: 243ms
Took: 241ms
Took: 242ms
Took: 254ms
Took: 253ms
Took: 245ms
Took: 240ms
Took: 241ms
Took: 251ms
Took: 239ms
Took: 257ms
Took: 265ms
Took: 256ms
Took: 254ms
Took: 254ms
Took: 248ms
Took: 251ms
Took: 251ms
Took: 247ms
Took: 240ms
Took: 244ms
Took: 238ms
Took: 243ms
Took: 255ms

It's basically ~250ms.

Normally, this doesnn't matter -- except almost all of my unit tests are of the form:
create a cuda env, run some stuff, throw everything away

=====

When I was using rust-accel, I could run a library with ~20 unit tests in almost instantaneously. Now, this is taking seconds of time (noticable delay).

Question: What work is rustacuda doing that rust-accel is NOT doing, and how can I avoid doing that work, so I can get the "setup / tear down time" to be minimal.

=====

The ptx file is not the problem. It contains a few very short functions.

Synchronisation issues for multi-dimensional launch calls

Hey,

I implemented a simple kernel (just kopies each pixel of an image) and issues in the lower part of the image:

The black stripes at the bottom of the image are different on each call, but always get larger from top to bottom. Therefore I assume that stream.synchronise()?; has an issue for multi-dimensional kernel launches like this:


launch!(module.conv2d<<<(20, 20, 1), (32, 24, 1), 0, stream>>>(
...
)?;

(note: the image size is 640x480 pixels)

How can the synchronisation issue be solved?
Should I restrict my kernels to 1-Dimensional block and thread dimensions?

launch!(module.conv2d<<<640, 480, 0, stream>>>( ... )?;

elimininates the issue.

Panic during CUDA types' Drop implementations inside thread-local storage

Hi, I'm writing a Python module that uses RustaCUDA and I am running into an issue where the library panics when Python is exiting.

I store some RustaCUDA state in a thread-local storage key, like this:

struct CUDAInstance {
    module: Module,
    _context: Context
}
impl CUDAInstance {
    fn init() -> CudaResult<CUDAInstance> {
        rustacuda::init(CudaFlags::empty())?;
        let device = Device::get_device(0)?;
        let context = Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device)?;

        Ok(CUDAInstance {
            _context: context,
            module: {
                const PTX: &str = concat!(include_str!(concat!("../cuda/cuda.ptx")), "\0");
                Module::load_from_string(unsafe { std::ffi::CStr::from_ptr(PTX.as_ptr() as *const i8) })?
            }
        })
    }
}
thread_local! {
    static CUDA: Result<CUDAInstance, CudaError> = CUDAInstance::init();
}

Everything works as expected, except that when Python has finished executing the script, Windows runs the thread-local storage destructor functions and RustaCUDA panics.

I would guess that CUDA has already been informed that the program is exiting and deinitializes itself, therefore the drop implementations panic, as they don't check if the types have already been deinitialized.

thread '<unnamed>' panicked at 'Failed to unload CUDA module: Deinitialized', C:\Users\William\.cargo\registry\src\github.com-1ecc6299db9ec823\rustacuda-0.1.3\src\module.rs:223:18
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\/library\std\src\panicking.rs:584
   1: core::panicking::panic_fmt
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\/library\core\src\panicking.rs:143
   2: core::result::unwrap_failed
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\/library\core\src\result.rs:1749
   3: enum$<core::result::Result<tuple$<>,enum$<rustacuda::error::CudaError> >, 1, 100101, Err>::expect
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\core\src\result.rs:1022
   4: rustacuda::module::impl$1::drop
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\core\src\ptr\mod.rs:188
   5: core::ptr::drop_in_place
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\core\src\ptr\mod.rs:188
   6: core::ptr::drop_in_place
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\core\src\ptr\mod.rs:188
   7: core::ptr::drop_in_place
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\core\src\ptr\mod.rs:188
   8: core::ptr::drop_in_place
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\core\src\ptr\mod.rs:188
   9: core::mem::drop
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\core\src\mem\mod.rs:909
  10: std::thread::local::fast::destroy_value<enum$<core::result::Result<smh_vision_gpu::cuda::CUDAInstance,enum$<rustacuda::error::CudaError> > > >
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\library\std\src\thread\local.rs:669
  11: std::sys::windows::thread_local_dtor::run_keyless_dtors
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\/library\std\src\sys\windows\thread_local_dtor.rs:24
  12: std::sys::windows::thread_local_key::on_tls_callback
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c\/library\std\src\sys\windows\thread_local_key.rs:200
  13: RtlInitializeConditionVariable
  14: RtlActivateActivationContextUnsafeFast
  15: LdrLoadAlternateResourceModuleEx
  16: LdrShutdownProcess
  17: RtlExitUserProcess
  18: ExitProcess
  19: exit
  20: exit
  21: <unknown>
  22: BaseThreadInitThunk
  23: RtlUserThreadStart
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

bheisler / rustacuda Goto Github PK

rustacuda's People

Contributors

Stargazers

Watchers

Forkers

rustacuda's Issues

DeviceAllocator

MaybeUninit

DeviceBox<[T]>

Async