Giter Site home page Giter Site logo

cvangysel / device_matrix Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 2.0 38 KB

The device_matrix library is a lightweight, transparent, object-oriented and templated C++ library that encapsulates CUDA memory objects (i.e., tensors) and defines common operations on them.

Home Page: http://ilps.science.uva.nl

License: MIT License

CMake 4.30% Cuda 29.51% C++ 66.20%
cuda device-matrix machine-learning deep-learning

device_matrix's Introduction

The device_matrix library

device_matrix is a lightweight, transparent, object-oriented and templated C++ library that encapsulates CUDA memory objects (i.e., tensors) and defines common operations on them.

Requirements & installation

To build the library and manage dependencies, we use CMake (version 3.5 and higher). In addition, we rely on the following libraries:

  • CUDA (version 8 and higher preferred), and
  • glog (version 0.3.4 and higher).

The cnmem library is used for memory management. The tests are implemented using the googletest and googlemock frameworks. CMake will fetch and compile these libraries automatically as part of the build pipeline. Finally, you need a CUDA-compatible GPU in order to perform any computations.

To install device_matrix, the following instructions should get you started.

git clone https://github.com/cvangysel/device_matrix
cd device_matrix
mkdir build
cd build
cmake ..
make
make test
make install

Please refer to the CMake documentation for advanced options.

Examples

The following examples can also be found in the examples sub-directory of this repository. These examples will also be compiled as part of the build process.

Matrix multiplication

#include <device_matrix/device_matrix.h>
	
#include <glog/logging.h>
#include <memory>

using namespace cuda;
	
int main(int argc, char* argv[]) {
    google::InitGoogleLogging(argv[0]);
	
    const cudaStream_t stream = 0; // default CUDA stream.
	
    std::unique_ptr<device_matrix<float32>> a(
        device_matrix<float32>::create(
            stream,
            {1.0, 2.0, 3.0, 4.0, 5.0, 6.0},
            2 /* num_rows */, 3 /* num_columns */));
	
    std::unique_ptr<device_matrix<float32>> b(
        device_matrix<float32>::create(
            stream,
            {7.0, 8.0, 9.0, 10.0, 11.0, 12.0},
            3 /* num_rows */, 2 /* num_columns */));
	
    device_matrix<float32> c(
        2 /* num_rows */, 2 /* num_columns */, stream);
	
    matrix_mult(stream,
                *a, CUBLAS_OP_N,
                *b, CUBLAS_OP_N,
                &c);
	
    cudaDeviceSynchronize();
	
    print_matrix(c);
}

Custom CUDA kernels

#include <device_matrix/device_matrix.h>

#include <glog/logging.h>
#include <memory>

using namespace cuda;

template <typename FloatT>
__global__
void inverse_kernel(FloatT* const input) {
    size_t offset = threadIdx.y * blockDim.x + threadIdx.x;
    input[offset] = -input[offset];
}

int main(int argc, char* argv[]) {
    google::InitGoogleLogging(argv[0]);

    const cudaStream_t stream = 0; // default CUDA stream.

    std::unique_ptr<device_matrix<float32>> a(
        device_matrix<float32>::create(
            stream,
            {1.0, 2.0, 3.0, 4.0, 5.0, 6.0},
            2 /* num_rows */, 3 /* num_columns */));

    LAUNCH_KERNEL(
        inverse_kernel
            <<<1, /* a single block */
               dim3(a->getRows(), a->getCols()), /* one thread per component */
               0,
               stream>>>(
            a->getData()));

    cudaDeviceSynchronize();

    print_matrix(*a);
}

Design principles

device_matrix was explicitly designed to be inflexible with regards to variable passing/assignment as the lifetime of a device_matrix instance directly corresponds to the lifetime of the CUDA memory region it has allocated. That means that CUDA memory remains allocated as long as its underlying device_matrix exists and that device_matrix instances can only be passed as pointers or references. This gives total control of the CUDA memory allocation to the programmer, as it avoids garbage collection (e.g., Torch) or reference counting (e.g., shared_ptr), and allows for optimized CUDA memory usage. It uses cnmem for its memory management in order to avoid performance issues that occur due to the recurrent re-allocation of memory blocks of a particular size.

To avoid the implicit allocation of on-device memory, any operation resulting in a new allocation needs to be explicit in this. Most operations that return a new result will therefore reuse one of its inputs as destination memory space (in the process, the original input values will be overwritten!). As a result of this, C++ operators that imply value modification were deliberately omitted.

The underlying CUDA memory space can easily be accessed by the library user. This allows the user to write arbitrary CUDA kernels that perform non-standard operations on CUDA objects in-place.

License

device_matrix is licensed under the MIT license. CUDA is a licensed trademark of NVIDIA. Please note that CUDA is licensed separately.

If you modify device_matrix in any way, please link back to this repository.

device_matrix's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

phillette enet4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.