Giter Site home page Giter Site logo

eyalroz / cuda-kat Goto Github PK

View Code? Open in Web Editor NEW
105.0 8.0 8.0 1.6 MB

CUDA kernel author's tools

License: BSD 3-Clause "New" or "Revised" License

Cuda 82.61% C++ 16.71% CMake 0.68%
cuda cuda-kernels utility-library utility-functions cpp11 constexpr algorithms patterns modern-cpp gpu-programming

cuda-kat's People

Contributors

eyalroz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuda-kat's Issues

Use functions instead of macros in on_device/printing.cuh

on_device/printing.cuh currently uses macros, mostly in order to easily "plug" the format string into a longer string.

This can actually be achieved, using a little voodoo, without in a plain function, and without any heap allocation. Let's try and do that.

Doxygen-comment the code

Much of the code is lacking Doxygen comments - especially file-level comments. Write those in.

Introduce a lane_id_t/lane_t type

We have many functions returning lane ids or numbers-of-lanes. Mostly those use unsigned. But for better readability/clarity, I'm thinking of introducing something like:

using lane_id_t = unsigned;

within the kat namespace.

Users - if I have any that look at this issue, which is unlikely - are welcome to comment on this prospect.

Check code coverage

Now that we have (half-)decent unit test coverage (see #24), we should introduce code coverage checks to see how much remains uncovered.

This requires:

  • Getting a coverage-related CMake module
  • Changing CMakeLists.txt (both of them? only for tests?) to use that module and to generate reports after a successful build
  • Working with a GitHub-interfacing service to visualize results or even check on the fly after each commit. Perhaps codecov?

See:

Rearrange math builtins

"Built-ins" in cuda-kat means those functions which translate into single PTX instructions (not necessarily single SASS instructions though!)

We have on_device/builtins.cuh, and on_device/non-builtins.cuh which contains functions which are builtin-like, or one might expect to be built-in, but aren't, and instead have a pretty tight implementation - one or two lines - which calls a builtin.

There's a problem, though, with certain functions which only translate into single PTX instructions when --use_fast_math is specified as a compiler switch. Example: cosf() in the CUDA math function header. With --use_fast_math, it yields something like:

        cos.approx.ftz.f32      %f2, %f1;

but if you remove the switch, you get a sequence of over 150 (!!) PTX instructions, including several loops, for computing the cosine. See this on GodBolt.

So, cosf() and other functions are sometimes builtins and sometime they aren't. Where should we put them then?

Idea:

  • Have three files - builtins.cuh, non-builtins.cuh and maybe-builtins.cuh.
  • maybe-builtins.cuh will have everything depending on a switch.
  • The former two may have a bunch of "using ...instructions, conditioned on the compilation happening with the appropriate switch. Thus, if all three are included,builtins::cosine()will work if--use_fast_mathis on, and fail otherwise; and the opposite goes for non_builtins::cosine()`.

PS:

--use_fast_math is not even the only switch: We have --prec-div (for precise division) --prec-sqrt (for precise square root) and --fmad for enabling floating-point fused-multiply-add instructions. I should also not that not all functions which depend on these four switches are actually covered by cuda-kat right now, although that's a different issue.

Make kat::tuple compatible with std::tuple

We've adapted a tuple implementation; however, that tuple doesn't know that there's "another tuple" it needs to be compatible with... we do know. So, let's try and make kat::tuple usable wherever on the host side we can use std::tuple; and vice-versa.

Specialize functions with many reads/writes for sub-4-byte element types

We have many templated functions which make a (potentially) large number of reads or writes to memory, and therefore benefit from coalescing their memory operations. However, most, if not all of them are not specialized for element types below 4 bytes long, and are therefore slower than they might have been. Examples include copying, filling, appending to global memory etc.

We should add specializations for these cases.

Consider using sized integer types for the builtins

While nVIDIA's own C headers for builtin wrappers use the fundamental types int, unsigned, unsigned long long etc. - the builtins are actually based on exact parameter sizes, not the wishy-washy C integer types. Should we then not make our builtins reflect that, by taking only intNN_t's and uintNN_ts as integer parameters?

Shoud we adhere to CUDA's confusing use of "index" and "id"?

An index is either a "list of items" arranged in order, or "a number... used as an indicator or measure", or "a number ... associated with another to indicate... position in an arrangement".

An id is an identification, or " a document bearing identifying information".

So, if we have an item in a 3D block, the triplet of coordinates in each axes is - literally - that item's 'id' not its 'index'. If anything, it is its position in a linearization of that grid that could be considered its 'index'.

... Unfortunately, CUDA defines this exactly backwards: The 'index' in CUDA is the 3-dimensional entity, and the 'id' is the number.

Currently, in my code, I'm making a bit of a mess of these two conceptions. I'm going to sort that out, but the question is how?

  1. Keep CUDA's terminology, use _index() functions in grid_info to return uint3's and dimensions_t's.
  2. Switch CUDA's terminology as described above, advise users to do the same.
  3. Use only neutral terms, e.g. 'position' and 'linearized_id'.
  4. Mix of the above.

What do you think?

Add a device-side-enabled version of gsl::span or std::span

std::span / gsl::span are very useful in host-side code: A pointer+length pair with standard-library-container trappings (iterators, operators, usable in standard algorithms etc.)

Now, it's not as though you should just use spans willy-nilly; they can have some overhead, but - they can sometimes make sense. The may be particularly useful as kernel parameters - pay the overhead just once per thread, then in the kernel you do what you like. And it makes it easier to work with memory regions - device-side and host-side.

See also issue #17.

We can base ourselves on either an GSL or a standard-library implementation.
But:

  1. We need it to work on the device side
  2. We need to detach it from the rest of the library (unless we want to import the entire GSL, which we probably don't):
    2.1 Drop Expects, Requires -> No (?) need for <gsl/gsl_assert>
    2.2 Roll our own byte or use unsigned char -> No(?) need for <gsl/gsl_byte> // for byte
    2.3 narrowing cast, other stuff ??? in <gsl/gsl_util>
  3. We can assume at least C++11 (perhaps 14?)

.... edit: Going with std::span adapted for earlier C++ versions. Hope it's not too bad. Still not sure if I shouldn't just use gsl::span instead.

Pass arguments by value to atomic functions

CUDA atomics apply to types of size at most 8 and which are trivially copyable. There is no benefit, therefore, from their wrappers taking constant references rather than actual value. In fact, this will mostly serve to cause problems if we try to pass rvalues.

So, let's just drop all that use of const &T in favor of Ts.

Also, this will mean most atomic wrappers only take a single address (pointer/reference), so we no longer need to mark them with __restrict__ (!).

Untangle the mess in primitives/

The code under src/cuda/on_device/primitives is a hot mess.

I mean, most of it is very useful, but not all of it; and there's almost no order to the different files except w.r.t. to the scope of collaboration (warp/block/grid).

At the very least we need to:

  1. Remove code whose general usefulness is limited/questionable.

  2. Extract related functionality into a separate file (or files for differnet scope):

    1. Shared memory
    2. Thread/lane coordination
    3. Iteration/coverage patterns (like at_warp_stride())
    4. Reductions and reduction-like operations
    5. (Search?)

    ... and do it while keeping the namespace scheme (e.g. separating block-scope from grid-scope functions).

  3. Consider duplicate functionality (there's probably a bit of that in there)

Support system-wide and block-wide atomics

Beginning with CUDA 10 (or maybe 9?) we have three kinds of atomics:

  • atomicFoo() - atomic w.r.t. other memory access from within the same GPU.
  • atomicFoo_system() - atomic w.r.t. memory access from any GPU and from the host, on the same system.
  • atomicFoo_block() - atomic w.r.t. memory accesses from threads in the same thread block only.

We currently support only the first kind, but should support the other two.

Use "xxx_index" in dimensioned contexts and "xxx_id" in linearized ones

In a 2D or 3D block, the CUDA "thread index" - according to official documentation - is a 3D or 3D entity, while the "thread ID" is its linearization (where x changes fastest, then y, then z).

cuda-kat currently doesn't observe this distinction, due to a bias in favor of work with linear grids. We should make sure it is respected in:

  • Function and member names (especially but not limited to grid_info:: namespaces)
  • Parameter names
  • Implementations

We should also check if there are separate special registers for the dimensioned index and the id; that could be useful.

Bring some order to `_safe` vs `_unsafe`, constexpr vs non-constexpr math functions

At the moment, some of our math utility functions have both constexpr and non-constexpr variants, in different files, while some have only the constexpr ones which work at run-time as well. But - we indicate the first case with the kat::constexpr_ namespace. Is this explained anywhere? Not really. Also unexplained is why math.cuh includes constexpr_math.cuh.

To add to the fun, we have several functions with two implementations, foo_safe() and foo_unsafe(); and - some of the non-safe versions are actually in constexpr_math...

We should get this stuff in order.

Distinguish between PTX builtins and SASS builtins

At the moment, our effective definition of a "builtin" function is one that produces a single PTX instruction (when inlined); and this definition is not even entirely consistent in our code.

However, PTX instructions are in no way guaranteed to become a single SASS instruction. An example which motivated our inconsistency: CLZ vs CTZ. There's is a CLZ instruction in PTX. But... no NVIDIA micro-architecture has that as a single instruction. It's just implemented using SASS internally somewhere.

It should be clear to the user of cuda-kat what will result in a single hardware instruction and what may or may not be one.

Cover all functionality with basic unit tests

Most code in the library is currently not covered by any unit tests. Let's add that coverage.

  • src/kat/containers/span.hpp
  • src/kat/containers/array.hpp
  • src/kat/containers/tuple.hpp
  • src/kat/on_device/time.cuh
  • src/kat/on_device/c_standard_library/string.cuh
  • src/kat/on_device/miscellany.cuh
  • src/kat/on_device/wrappers/atomics.cuh
  • src/kat/on_device/wrappers/builtins.cuh
  • src/kat/on_device/wrappers/shuffle.cuh
  • src/kat/on_device/shared_memory/operations.cuh
  • src/kat/on_device/shared_memory/basic.cuh
  • src/kat/on_device/non-builtins.cuh
  • src/kat/on_device/constexpr_math.cuh
  • src/kat/on_device/math.cuh
  • src/kat/on_device/collaboration/warp.cuh
  • src/kat/on_device/collaboration/grid.cuh
  • src/kat/on_device/collaboration/block.cuh
  • src/kat/on_device/sequence_ops/warp.cuh
  • src/kat/on_device/sequence_ops/grid.cuh
  • src/kat/on_device/sequence_ops/block.cuh
  • src/kat/on_device/streams/*
  • src/kat/on_device/ranges.cuh

Skipping:

  • src/kat/on_device/unaligned.cuh <- Dropped from the repository (but expected to make a comeback soon - with unit tests)
  • src/kat/on_device/grid_info.cuh <- These are all one-liners; it's like testing an int add(int x, int y) { return x+y; } function - what can you test about it?
  • src/kat/on_device/printing.cuh <- Dropped from the repository (for now)
  • src/kat/on_device/ptx/* <- Mostly covered by "builtins" unit tests, and not quite "user-facing".

Cull some of the printing.cuh code

printing.cuh currently has some redundant code, some unused code, code that's really not GPU-specific in any way, and code which may not be significant enough to publish as part of the library.

I should make a pass over the file and remove most/all of this code.

Missing include in containers/array.hpp

For std::reverse_iterator we need #include <iterator>.

Something like:

#include <kat/containers/array.hpp>
int main() {
kat::array<int, 7> arr;
}

should fail to compile.

Add <algorithm> and <numeric> functions as thread-level primitives?

While it's rarely a great idea, for the sake of completeness, we may want to have implementations of the Add abstract <algorithm> and <numeric> algorithms which could be run by all threads without collaboration, each on its own data.

What do you think? Good idea? Bad idea?

See also issue #18.

Semantics of atomic::increment() and atomic::decrement() wrong

I've somehow mis-implemented atomic::increment() and atomic::decrement(). Need to bring the comments and the implementation in line with the CUDA Programming Guide description, as well as make sure that the default arguments actually do something useful (which, currently, they do not).

Implement kat::linear_grid::collaborative::block::at_warp_stride()

We have grid-scope action in two forms - at grid stride and at block stride. The block stride action means each block acts on consecutive data. At block-scope - we only have the first kind of action, where the stride involves all collaborating elements. We don't have the second kind of action, but with warps, which may be told how many consecutive warp's worth of positions to act on.

Add wrappers (and builtins?) for more PTX instructions

The following PTX instructions don't have wrapper functions (nor builtins:: templated functions where relevant). Add them!

  • lop3 - Logical operation on 3 operands using an immediate 3-parameter lookup table.
  • prefetching instructions?
  • cvt.pack
  • fns - find n'th bit set
  • Sub-32-bit dot product with accumulation: dp4a, dp2a for bytes and halfword, respectively.

Should we use CUDA's implicit host-device move and forward?

The programming guide says:

E.3.14.3. Rvalue references

By default, the CUDA compiler will implicitly consider std::move and std::forward function templates to have __host__ __device__ execution space qualifiers, and therefore they can be invoked directly from device code. The nvcc flag --no-host-device-move-forward will disable this behavior; std::move and std::forward will then be considered as __host__ functions and will not be directly invokable from device code.

We currently use our own kat:: versions of those two functions. Should we drop them in favor of std::move() and std::forward(), relying on this behavior of CUDA's?

Add shuffle tests with non-power-of-2 sizes

At the moment, we only test the shuffle::xxx functions with types of sizes 1, 2, 4 and 8. We should test them with types of other sizes, in particular: 3, 5, 7, and sizes beyond 8.

Support atomicCAS() for all types

We need to have an atomicCAS() equivalent available, some way or another for all types up to the hardware capability for atomic ops (8 bytes, i.e. unsigned long long int). Right now there's only apply_atomically() exposed, which is nice, but not enough.

So let's:

  1. Properly expose atomic::compare_and_swap() the way it is now (i.e. only for the types CUDA supports directly).
  2. Implement atomic::compare_and_swap() for smaller types using the larger type.

Should we drop support for CUDA 8.x?

CUDA 9.0 was release in September of 2017 - 2.5 years ago. It changed the interfaces of some functions and related PTX instructions. Mostly, .sync versions of these were now to be used, which take lane mask parameters, e.g. warp balloting only among the lanes with 1 bits in the mask.

Should we continue to support the CUDA 8.x and earlier versions of these functions/PTX intrinsics, or is it safe to just drop them?

Block-level conjuction and disjunction

We should implement all_satisfy(), none_satisfy() and some_satisfy() at the block level (using the warp-level primitives and shared memory to exchange information).

Use spans as parameters where relevant

With a reasonable compiler and a reasonable implementation of a gsl::span-like class (see issue #6), there should be no penalty for having device functions take a kat::span<T> instead of a T* and a size_t length. So let's convert some functions to using that...

Of course we might need to make the span templated on the appropriate size type.

Array.fill with constexpr variable

This is a general problem of nvcc I would say:

#include <array>
#include <kat/containers/array.hpp>

constexpr int duzzle = -7;

__global__
void kernel() {
	kat::array<int, 7> arr;
	arr.fill(duzzle); // fails to compile
}

int main() {

	std::array<int, 7> arr;
	arr.fill(duzzle);
}

The problem here is the signature of fill(const value_type&). E.g. if we omit the reference it works fine or if we define the constexpr variable in the device function itself. A real funny workaround:

__global__
void kernel() {
	kat::array<int, 7> arr;
        constexpr int duzzle_ = duzzle;
	arr.fill(duzzle_);
}

Inappropriate return type and incorrect calculation in grid_info.cuh

In grid_info.cuh:

  1. The variants of the index() function current return dimensions_t rather than unsigned. This might fit a "position" function (perhaps the subject of another issue I'll open, but certainly not an index, singluar.
  2. The templated index_in_grid() function for thread doesn't current compute the index correctly; plus, it shouls use other, already-defined functions which themselves call detail::row_major_linearization() rather than calling that directly.
  3. Recursive call of warp::index() to itself.

Simplify some of the numeric functions in math.cuh and constexpr.cuh

There is overuse of const& T over T for parameters of some math fuctions. These functions are intended for numbers, not more complex objects; plus, they are inlined and simple enough for any copies to be optimized away, so let's not complicate our lives with references here.

Add missing <algorithm> functions to the warp and block sequence-ops

We need most, if not all, of the functions in
<algorithm> and <numeric> available on the device, for execution at the warp and block level. (But not the uninitialized memory stuff, nor qsort/bsearch etc.)

Implementation status:

Non-modifying sequence operations

  • all_of
  • any_of
  • none_of
  • for_each
  • for_each_n
  • count
  • count_if
  • mismatch
  • find
  • find_if
  • find_if_not
  • find_end
  • find_first_of
  • adjacent_find
  • search
  • search_n

Modifying sequence operations

  • copy
  • copy_if
  • copy_n
  • copy_backward
  • move
  • move_backward
  • fill
  • fill_n
  • transform
  • generate
  • generate_n
  • remove
  • remove_if
  • remove_copy
  • remove_copy_if
  • replace
  • replace_if
  • replace_copy
  • replace_copy_if
  • swap
  • swap_ranges
  • iter_swap
  • reverse
  • reverse_copy
  • rotate
  • rotate_copy
  • shift_left
  • shift_right
  • random_shuffle
  • shuffle
  • sample
  • unique
  • unique_copy

Partitioning operations

  • is_partitioned
  • partition
  • partition_copy
  • stable_partition
  • partition_point

Sorting operations

  • is_sorted
  • is_sorted_until
  • sort
  • partial_sort
  • partial_sort_copy
  • stable_sort
  • nth_element

Binary search operations (on sorted ranges)

  • lower_bound
  • upper_bound
  • binary_search
  • equal_range

Other operations on sorted ranges

  • merge
  • inplace_merge

Set operations (on sorted ranges)

includes

  • set_difference
  • set_intersection
  • set_symmetric_difference
  • set_union

Heap operations

  • is_heap
  • is_heap_until
  • make_heap
  • push_heap
  • pop_heap
  • sort_heap

Minimum/maximum operations

  • max
  • max_element
  • min
  • min_element
  • minmax
  • minmax_element
  • clamp

Comparison operations

  • equal
  • lexicographical_compare
  • compare_3way
  • lexicographical_compare_3way

Permutation operations

  • is_permutation
  • next_permutation
  • prev_permutation

Numeric operations

  • iota
  • accumulate
  • inner_product
  • adjacent_difference
  • partial_sum
  • reduce
  • exclusive_scan
  • inclusive_scan
  • transform_reduce
  • transform_exclusive_scan
  • transform_inclusive_scan

shuffles are warp collaboration primitives.

Shuffles are warp collaboration primitives. They should be in namespace kat::collaboration::warp - and declared in the warp collaboration primitives header - if only perhaps through an inclusion of another file.

Add testing instrumentation

While this code originates in other repositories, which do have some unit and other testing - that can't be migrated here. We should have tests for all functionality in this repository, and a proper testing framework/library for them.

I'm thinking of doctest but my mind is not fully made up yet.

Device-side function for pretty-printing (part of a) column

This repo should have a device-side function (probably for a single thread to run) which pretty-prints a part of a column in GPU memory. It should be a well-pimped function with lots of knobs and levers for configuring the printing (elements per row, separators, inter-column spacing, width, chars vs numbers, numeric base, bit-resolution interpretation yes/no, true/false symbols for bits or booleans, indices at start of line, header lines yes/no etc.)

Implement the non-linear-grid variants of all grid_info functions

At the moment, a lot of the functions in grid_info.cuh are only relevant to linear grids. We should:

  • Ensure all linear-grid specific functions are under kat::linear_grid::grid_info.
  • Ensure all functions implemented for linear grids are implemented for non-linear grids (unless they're irrelevant).
  • Give some more thought to function naming, considering the linear-vs-non-linear grid dichotomy.

Templatize grid_info functions for less-than-3-dim kernels

While we have a separate namespace for grid_info functions in linear grids, I also want to templatize the general versions of the functions to support working with such grids and minimizing unnecessary computations (= based on .y and .z value of positions and dimensions). I also want to do something similar for 2-D grids. So - I'm going to templatize.

Will implement this together with a "choice" w.r.t. issue #50.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.