moderngpu / moderngpu Goto Github PK

Patterns and behaviors for GPU computing

Home Page: http://moderngpu.github.io/moderngpu

License: Other

Cuda 9.51% C++ 87.63% CMake 2.86%

moderngpu's Introduction

moderngpu

moderngpu is a productivity library for general-purpose computing on GPUs. It is a header-only C++ library written for CUDA. The unique value of the library is in its accelerated primitives for solving irregularly parallel problems.

(c) 2021 Sean Baxter
You can drop me a line here
Full documentation with github wiki under heavy construction.

Quick Start Guide

git clone https://github.com/moderngpu/moderngpu.git
cd moderngpu
mkdir build && cd build
cmake ..
make # or make name_of_project to build a specific binary
./bin/test_segreduce

How to Cite

@Unpublished{     Baxter:2016:M2,
  author        = {Baxter, Sean},
  title         = {moderngpu 2.0},
  note          = {\url{https://github.com/moderngpu/moderngpu/wiki}},
  year          = 2016
}

moderngpu's People

Contributors

Stargazers

Watchers

Forkers

reizencroft purcaro ajaiy b0ng nivertech javarange atarighat mancobian scotshinderman souravzzz cyteck farshadow xh4xh4 littlepretty jaw vdt stormbrew wuhaicheng27 bhuthesh raymondtay tbalazs dproteus13 nvdnkpr domluna lmyerssig nerei jaredhoberock favormm linki bordaw fadingmemo azuredsky gijs kangshiyin skynowa madhur13490 daltonb luckysherry malikmk hohocode vincuemor zhangjun soulsheng josephwinston ovicin wangbiaouestc gabyx puneet1488 deepcloudptyltd lukeke zhou13 amit-gupta bpierce1 1115418 aaron-l sorrachai gavin90s shengren shenguojun bambang u-ma jmohdyusof vijayky88 thisisgaara stonexjr yingjerkao shriphani varunnagpaal dathayer23 ranrui04 klonikar carl-ellis lifengliu codestamp nagyistoce jeromeku rtvt123 asifmadnan phrodod dingke jcmoure sdalton1 singular78 zeryx alysondeives shridharama ankur-maximos rom3r4 caomw extabgrad corsy creacominc elekezem solertis acbrewbaker codeaudit sarvex yaohuaxin kaiwenguo tanngocdo

moderngpu's Issues

can't build tut_04_merge

repro:

$ nvcc tut_04_merge.cu -I../src/ -std=c++11 -expt-extended-lambda
tut_04_merge.cu(39): error: no instance of function template "mgpu::thread_iterate" matches the argument list
            argument types are: (int, int, lambda [](int, int)->void)
          detected during instantiation of "void simple_merge<launch_t,a_it,b_it,c_it,comp_t>(a_it, int, b_it, int, c_it, comp_t, mgpu::context_t &) [with launch_t=mgpu::launch_params_t<128, 8, 8, 0>, a_it=int *, b_it=int *, c_it=int *, comp_t=mgpu::less_t<int>]"
(76): here

1 error detected in the compilation of "/var/folders/3q/bvdwl5dj72g53sldd774gdq80000gn/T//tmpxft_00016520_00000000-9_tut_04_merge.cpp1.ii".

Seems like thread_iterate overload is missing in https://github.com/NVlabs/moderngpu/blob/master/src/moderngpu/meta.hxx#L200

Cannot compile moderngpu with clang as the CUDA compiler

When using clang-12 as the CUDA compiler and CUDA Toolkit 11.4.1, it complains about PRAGMA_UNROLL usages. It is defined as

#ifndef PRAGMA_UNROLL
#ifdef __CUDA_ARCH__
  #define PRAGMA_UNROLL #pragma PRAGMA_UNROLL
#else
  #define PRAGMA_UNROLL
#endif
#endif

in meta.hxx. This syntax seems supported by nvcc only.

Performance comparison with cub

Hi,

thank you for your lib.
Do you have performance comparison with cub lib?

Usage of caching iterators

I am trying to optimize the runtime of many calls to mgpu::lbs_segreduce with the same segment configuration.
If I understand correctly, there is a way to use caching iterators.
Could you maybe give an example how to use them?
There is no documentation on that topic yet.

Error while running test

I am trying to run test on mergesort and segreduce, only to find the folling error

NVIDIA GeForce RTX 2080 : 1710.000 Mhz   (Ordinal 0)
46 SMs enabled. Compute Capability sm_75
FreeMem:   7862MB   TotalMem:   7981MB   64-bit pointers.
Mem Clock: 7000.000 Mhz x 256 bits   (448.0 GB/s)
ECC Disabled


Floating point exception (core dumped)

How could I fix the bug?

SegmentedSort fails with large numbers

I edited the segsortbenchmark to include a line with 120M items and mean segment length of 60K, it then fails:

Segmented sort keys on type int.
   10M -    100:  1219.470 M/s    9.756 GB/s
   10M -    300:  1174.506 M/s    9.396 GB/s
   10M -     1K:  1084.914 M/s    8.679 GB/s
   10M -     3K:   946.366 M/s    7.571 GB/s
   10M -    10K:   788.421 M/s    6.307 GB/s
   10M -    30K:   671.062 M/s    5.368 GB/s
   10M -   100K:   570.081 M/s    4.561 GB/s
   10M -   300K:   500.293 M/s    4.002 GB/s
   10M -     1M:   442.928 M/s    3.543 GB/s
   10M -     3M:   396.381 M/s    3.171 GB/s
   10M -    10M:   396.540 M/s    3.172 GB/s
   30M -    15K:   741.859 M/s    5.935 GB/s
  120M -    60K: 10235.990 M/s   81.888 GB/s
MISMATCH AT 0

RapidJson include directory not set.

CMake Error at CMakeLists.txt:8 (message):
RapidJson include directory not set.

CMake Error at CMakeLists.txt:14 (message):
Modern GPU include directory not set.

CMake Error at CMakeLists.txt:20 (message):
CUB library include directory not set.

CMake Error at CMakeLists.txt:48 (CUDA_ADD_LIBRARY):
Unknown CMake command "CUDA_ADD_LIBRARY".

Merge Sort Infinite Loop

Looks like the mergesort has a bug when the number of items is small:

  standard_context_t context;

  // Loop from 1K to 100M.
  for(int count = 1; count <= 2000; count += count / 10) {
    for(int it = 1; it <= 5; ++it) {

      mem_t<int> data = fill_random(0, 100000, count, false, context);

      mergesort(data.data(), count, less_t<int>(), context);

      std::vector<int> ref = from_mem(data);
      std::sort(ref.begin(), ref.end());
      std::vector<int> sorted = from_mem(data);

      bool success = ref == sorted;
    
      printf("%7d: %d %s\n", count, it, success ? "SUCCESS" : "FAILURE");

      if(!success)
        return 1;
    }   
  }

I got infinite loop when count is small, e.g., 1, 2, 3, 4, 5, .... When i is big, like over 1000, everything runs fine on my Nvidia GPU. Any thoughts?

Error in building

When I run make I get the following error:

demo/cities.cu(265): internal error: assertion failed: gen_paren_or_brace_dynamic_init: bad kind (cp_gen_be.c, line 21252 in gen_paren_or_brace_dynamic_init)

1 catastrophic error detected in the compilation of "demo/cities.cu".
Compilation aborted.
Aborted (core dumped)

I am not sure how to fix this. Any suggestions?

good launch_params_t for mergesort

I'm currently in the process of replacing thrust with moderngpu because of Oblomov/titanxstall#2

My problem now is to choose an adequate launch_params_t depending on the number of elements to be sorted, as that number can be changed by the user.

Thrust chooses this automatically in a way, I did not find out how, but it works quite good.

Any recommendations?

ModernGPUv1

We'd like to compile some code using ModernGPUv1 with CUDA 9. But CUDA 9 raises a warning about using __shfl_up():

moderngpu/include/device/../device/intrinsics.cuh(113): warning: function "__shfl_up(float, unsigned int, int)"
moderngpu/include/device/../device/intrinsics.cuh(123): warning: function "__shfl_up(int, unsigned int, int)"
moderngpu/include/device/../device/intrinsics.cuh(124): warning: function "__shfl_up(int, unsigned int, int)"
/usr/include/sm_30_intrinsics.hpp(175): here was declared deprecated ("__shfl_up() is deprecated in favor of __shfl_up_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

It suggests using __shfl_up_sync() instead.

Nvidia warns that not to upgrade blindly:

Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.

Do you know if the 0xffffffff (FULL_MASK) mask is appropriate for ModernGPUv1? The ModernGPUv1 code raising the warnings is copied below for easy reference:

#pragma push_macro("__shfl_up")
#undef __shfl_up
__device__ __forceinline__ float shfl_up(float var, 
  unsigned int delta, int width = 32) {

#if __CUDA_ARCH__ >= 300
  var = __shfl_up(var, delta, width);
#endif  
  return var;
}

__device__ __forceinline__ double shfl_up(double var, 
  unsigned int delta, int width = 32) {

#if __CUDA_ARCH__ >= 300
  int2 p = mgpu::double_as_int2(var);
  p.x = __shfl_up(p.x, delta, width);
  p.y = __shfl_up(p.y, delta, width);
  var = mgpu::int2_as_double(p);
#endif
  
  return var;
}
#pragma pop_macro("__shfl_up")

Citation

Hi,
How can I cite this work?

Problem with compiling bf-knn

cd bf-knn
nvcc -arch=sm_21 -I ../moderngpu/include/ bf_knn_device.cu bf_knn_host.cu demo.cc -o bf-knn-demo
i am facing a problem when compiling this code in CMD. help me

ERROR:
bf_knn_device.cu<44> : fatal error C1083: can not open include file 'moderngpu,cuh' No such a file or directory
kindly help me how to include this file

Invalid shared read of size 8

When using moderngpu for set intersection, I met this error:

Invalid __shared__ read of size 8
=========     at 0x00001530 in void mgpu::KernelSetOp<mgpu::LaunchBoxVT<int=128, int=23, int=0, int=128, int=11, int=0, int=128, int=11, int=0>, mgpu::MgpuSetOp, bool=1, int=2, bool=0, __int64 const *, __int64 const *, long*, int const *, int const *, int*, mgpu::less<long>>(int=0, int=0, int, int=128, mgpu::LaunchBoxVT<int=128, int=23, int=0, int=128, int=11, int=0, int=128, int=11, int=0>, int, int*, int const *, int=11, mgpu::MgpuSetOp, bool=1)

Help welcome.

std::binary_function deprecated in C++11 and removed in C++17

Reference: cppreference.com
Used here: https://github.com/moderngpu/moderngpu/blob/master/src/moderngpu/operators.hxx#L48-L124
Solution: I believe we just removed the std::binary_function inheritance.

This problem shows up when I try and compile moderngpu within my library while relying heavily on C++17. It may be due to how I am linking things. The system is using MSVC, CUDA 11.5.1, Windows.

@seanbaxter if you can give me permission to modify the project and create a new release with these changes, maybe I can also bump the version number.

wrong tid in cta_launch with non power of two NT

using cta_launch with NT not being a power of two results in wrong tid values passed to the lambda. simple test case:

static const int NT = 96
mgpu::cta_launch<NT>([=] MGPU_DEVICE (const int tid, const int block) {
      printf("thread %d %d %d\n", threadIdx.x, tid, threadIdx.x & (NT - 1));
}, 1, ctx);

For NT = 32, 64 and 128 this works fine. However setting NT to any multiple of 32 should be valid, right?

Support for "tie"

Hi! Would it be possible to complete support for mgpu::tie?
As is, it's missing support for the assignment of mgpu::tuple<args_t...> to
mgpu::tuple<args_t&...>.

Example

#include <moderngpu/tuple.hxx>

MGPU_DEVICE
mgpu::tuple<int, float>
example_tuple() {
  return mgpu::make_tuple(1, 2.f);
}


MGPU_DEVICE
mgpu::tuple<int, float>
example_using_tie() {
  int a;
  float b;
  mgpu::tie(a,b) = example_tuple();
  return mgpu::make_tuple(a,b);
}

yields

main.cu(15): error: no operator "=" matches these operands
            operand types are: mgpu::tuple<int &, float &> = mgpu::tuple<int, float>

Thanks!

Dependency on CUDA 5.0.props - Error opening the project in VS

Hi,

I have CUDA 5.5 installed and VS2012.

I tried to open the project but I am getting following error -

C:\Alenka\moderngpu\demo\demo.vcxproj(32,5): The imported project "C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V110\BuildCustomizations\CUDA 5.0.props" was not found. Confirm that the path in the declaration is correct, and that the file exists on disk.

I am getting the same error for all the projects.

Thanks,
Sawan

Documentation mistake for SegSortPairsFromFlags

On the documentation for SegSortPairsFromFlags, the sequence of parameters (count and flags_global) seems disordered.

Should be(KeyType* keys_global,
ValType* values_global, int count, const uint* flags_global,
CudaContext& context, Comp comp, bool verbose = false) instead.

However, the following is listed on the documentation currently:
// Segmented sort using head flags and supporting value exchange.
MGPU_HOST void SegSortPairsFromFlags(KeyType* keys_global,
ValType* values_global, const uint* flags_global, int count,
CudaContext& context, Comp comp, bool verbose = false);

Thank you!

cudaErrorInvalidConfiguration (error 9) on segmented_sort() of trivial-sized data set

moderngpu git tag: v2.12_june_8_2016
CentOS 7, Linux version 3.10.0-327.22.2.el7.x86_64
gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
Cuda compilation tools, release 7.5, V7.5.17

Calling segmented_sort() on a trivial data set (e.g., 1 segment of 24 values) results in cudaErrorInvalidConfiguration (error 9). Determined the cause to be located at kernel_segsort.hxx:69:

op_counters = fill<int2>(int2(), num_passes, context);

Per kernel_segsort.hxx:43-45, num_passes is set to 0 if num_ctas is equal to 1. Allocating an extra element in op_counters fixes the issue by allowing fill() to succeed:

op_counters = fill<int2>(int2(), num_passes + 1, context);

Minimum working example:

using uint = unsigned int;

#include "moderngpu/context.hxx"
#include "moderngpu/kernel_segsort.hxx"

#include <exception>
#include <sstream>

int main(int, char**) {
    using namespace mgpu;
    using namespace std;

    standard_context_t context;

    uint nValues = 24, nSegments = 1;
    uint *dKeys, *dVals, *dSegs;

    cudaMalloc(&dKeys, sizeof(*dKeys) * nValues);
    cudaMalloc(&dVals, sizeof(*dVals) * nValues);
    cudaMalloc(&dSegs, sizeof(*dSegs) * (nSegments + 1));

    uint hKeys[] = {
        98, 63, 82, 50, 40, 36, 44, 36, 49, 17, 21, 71,
        82, 67, 54, 74, 60, 15, 85, 58, 11, 42, 22, 97
    };

    uint hVals[] = {
        34, 11, 98, 41, 83, 34, 49, 40, 61, 14, 10, 82,
        19, 27, 80, 66, 54, 99, 79, 90, 70, 42, 16, 78
    };

    uint hSegs[] = {0, nValues};

    cudaMemcpy(dKeys, &hKeys[0], sizeof(*dKeys) * nValues, cudaMemcpyHostToDevice);
    cudaMemcpy(dVals, &hVals[0], sizeof(*dVals) * nValues, cudaMemcpyHostToDevice);
    cudaMemcpy(dSegs, &hSegs[0], sizeof(*dSegs) * (nSegments + 1), cudaMemcpyHostToDevice);

    segmented_sort(
        dKeys, dVals, (int)nValues,
        dSegs, (int)nSegments,
        less_t<uint>(), context);

    cudaDeviceSynchronize();
    cudaError_t error = cudaGetLastError();

    if (error != cudaSuccess) {
        ostringstream msg;
        msg << "cudaGetLastError(): " << cudaGetErrorString(error) << endl;
        throw runtime_error(msg.str());
    }

    cudaFree(dKeys);
    cudaFree(dVals);
    cudaFree(dSegs);
    return 0;
}

possibly improve lambda_iterator_t

is there a reason why make_{load_store,load,store}_iterator are not callable from device code and on the contrary why assign_t operators are not callable from host code? I would find it useful to have this functionality.

mgpu::mergesort: illegal memory access for >= 1500000000 keys

Context

We're benchmarking the performance of mgpu::mergesort (and other GPU sorting algorithms). In a nutshell, we generate the data on the host, copy it onto the device, initialize a stream, and measure the pure sort duration.

Example

#include <thrust/device_vector.h>
#include <src/moderngpu/kernel_mergesort.hxx>

#include <time.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <algorithm>
#include <chrono>

int main(int argc, char* argv[]) {
  const size_t num_elements = std::stoull(argv[1]);

  thrust::host_vector<int> host_elements(num_elements);
  std::generate(host_elements.begin(), host_elements.end(), rand);
  thrust::device_vector<int> elements = host_elements;

  cudaSetDevice(0);
  cudaDeviceSynchronize();
  cudaStream_t stream;
  cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

  auto t1 = std::chrono::high_resolution_clock::now();
  mgpu::standard_context_t context(false, stream);
  mgpu::mergesort(static_cast<int*>(thrust::raw_pointer_cast(elements.data())), num_elements, mgpu::less_t<int>(), context);
  cudaStreamSynchronize(stream);
  std::chrono::duration<double> t2 = std::chrono::high_resolution_clock::now() - t1;

  std::cout << num_elements << "," << std::fixed << std::setprecision(9) << t2.count() << "\n";

  if (!thrust::is_sorted(elements.begin(), elements.end())) {
    std::cout << "Error: Invalid sort order.\n";
  }

  return 0;
}

We compile the example with nvcc -O3 -std=c++11 --expt-extended-lambda -o mgpu_sort mpgu_sort.cu and run it with ./mgpu_sort <num_elements> on two different platforms.

IBM AC922: 4x NVIDIA Tesla V100 SXM2 32 GB, CUDA 11.2
NVIDIA DGX A100: 8x NVIDIA A100 SXM4 40 GB, CUDA 11.0

Error

On both platforms, for num_elements >= 1500000000, we get the following error:

terminate called after throwing an instance of 'mgpu::cuda_exception_t'
  what():  an illegal memory access was encountered

Can somebody help?

Compiling moderngpu in Nsight Eclipse

When I compile the package using make file it works. But I need to compile it in Nsight Eclipse. Although my gcc version is 4.8 and the flag for supporting c++11 is enabled in Nsight Eclipse, I get this error: "nvcc warning : The -c++11 flag is not supported with the configured host compiler. Flag will be ignored"
Could you please let me know how I can solve the problem?
Thanks

Cuda 8, VS2105

Hi,

I am having trouble getting reduce_by_key to compile. I think there are some compiler issues.
My system is VS2015 with Cuda 8.

The following "attribute((aligned))" in tuple.hxx

makes the compiler choke.

The second problem is the restrict keyword usage in meta.hxx:

It seems, that the __restrict__keyword ist not allowed for type definitions. It throws this compiler error:

https://msdn.microsoft.com/en-us/library/097sy9kt.aspx

Removing problem 1 and all types that include restrict works. However, I am not sure how this influences the functionality of the library.

Thanks for the great work!

Symmetric difference on device

1)Is there a way to compute symmetric difference calling function from the kernel?
2)Is symmetric difference implemented in moderngpu 2.0? If so, is there any documentation?

Cannot build demos on Linux

make: *** No rule to make target demo.o', needed bydemo'. Stop.

pez@pezmachine:~/Development/quantcrunch/demo$ make --version
GNU Make 3.81
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for x86_64-pc-linux-gnu

Error in FindSetPartitions

moderngpu git tag: V1.1
Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-106-generic x86_64)
gcc-4.8.real (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5
Cuda compilation tools, release 8.0, V8.0.44

I noticed that FindSetPartitions() generates an error whenever the number of elements is bigger than one block. For example, using FindSetPartitions() on two sorted arrays A and B of length 591 and 584 respectively I am getting:

0:       0 -2147483198    591

when the expected result should match that of MergePathPartitions():

0:       0    449    591

Minimum working example:

#include "kernels/sets.cuh"
#include "kernels/intervalmove.cuh"
#include <algorithm>

using namespace mgpu;

template<typename T>
__global__ void setZero( T *keys, const int total ) {
    for( int idx=threadIdx.x+blockIdx.x*blockDim.x; idx<total; idx+=blockDim.x*
            gridDim.x ) {
        keys[idx] = 0;
    }
}

template<typename T>
__global__ void lookRight( const T *keys, T *mark, const int total ) {
	for( int idx=threadIdx.x+blockIdx.x*blockDim.x; idx<total; idx+=blockDim.x*
			gridDim.x ) {
		if( keys[idx+1]!=keys[idx] ) mark[idx] = 1;
		else mark[idx] = 0;
	}
}

template<typename T>
__global__ void streamCompact( const T *mark, const T *scan, T *out, const int total ) {
	for( int idx=threadIdx.x+blockIdx.x*blockDim.x; idx<total; idx+=blockDim.x*
			gridDim.x )
		if( mark[idx] ) out[scan[idx]] = idx;
}

template<typename T>
void BenchmarkSetsPairs(int count, int numIt, CudaContext& context) {

	int aCount = count / 2;		 // number of A elements
	int bCount = count - aCount; // number of B elements

	MGPU_MEM(T) aKeys = context.SortRandom<T>(aCount, 0, count);
	MGPU_MEM(T) bKeys = context.SortRandom<T>(bCount, 0, count);
	MGPU_MEM(T) aValues = context.FillAscending<T>(aCount, 0, 1);
	MGPU_MEM(T) bValues = context.FillAscending<T>(bCount, 0, 1);

	MGPU_MEM(T) aMark = context.Malloc<T>(aCount);
	MGPU_MEM(T) bMark = context.Malloc<T>(bCount);
	MGPU_MEM(T) aScan = context.Malloc<T>(aCount);
	MGPU_MEM(T) bScan = context.Malloc<T>(bCount);

	const int NTHREADS = 128;
	const int NBLOCKS = 480;

    setZero<<<NBLOCKS,NTHREADS>>>( aMark->get(), aCount );
    setZero<<<NBLOCKS,NTHREADS>>>( bMark->get(), bCount );
    lookRight<<<NBLOCKS,NTHREADS>>>( aKeys->get(), aMark->get(), aCount-1 );
    lookRight<<<NBLOCKS,NTHREADS>>>( bKeys->get(), bMark->get(), bCount-1 );

	int aCountDupFree = 0;	
	int bCountDupFree = 0;
	// Only do scan for first aCount-1 since last one isn't defined	
	Scan<MgpuScanTypeExc>( aMark->get(), aCount-1, 0, plus<int>(), (int*)0, 
		&aCountDupFree, aScan->get(), context );
	Scan<MgpuScanTypeExc>( bMark->get(), bCount-1, 0, plus<int>(), (int*)0, 
		&bCountDupFree, bScan->get(), context );

	MGPU_MEM(T) aInd  = context.Malloc<T>(aCountDupFree);
	MGPU_MEM(T) bInd  = context.Malloc<T>(bCountDupFree);
	MGPU_MEM(T) aKeysUnique  = context.Malloc<T>(aCountDupFree);
	MGPU_MEM(T) bKeysUnique  = context.Malloc<T>(bCountDupFree);

	streamCompact<<<NBLOCKS,NTHREADS>>>( aMark->get(), aScan->get(), 
		aInd->get(), aCount );
	streamCompact<<<NBLOCKS,NTHREADS>>>( bMark->get(), bScan->get(), 
		bInd->get(), bCount );
	IntervalGather( aCountDupFree, aInd->get(), counting_iterator<int>(0),
		aCountDupFree, aKeys->get(), aKeysUnique->get(), context );
	IntervalGather( bCountDupFree, bInd->get(), counting_iterator<int>(0),
		bCountDupFree, bKeys->get(), bKeysUnique->get(), context );

	std::vector<T> aMarkHost, bMarkHost;
	std::vector<T> aScanHost, bScanHost;
	aMark->ToHost(aMarkHost);
	bMark->ToHost(bMarkHost);
	aScan->ToHost(aScanHost);
	bScan->ToHost(bScanHost);

	printf("A:\n");
	PrintArray(*aKeysUnique, "%6d", 10);
	printf("B:\n");
	PrintArray(*bKeysUnique, "%6d", 10);
	printf("\nA total:%d\nB total:%d\n", aCountDupFree, bCountDupFree);

	// Benchmark MGPU
    const int NT = 128;
    const int VT = 7;
    typedef LaunchBoxVT<NT, VT> Tuning;
    int2 launch = Tuning::GetLaunchParams(context);
    const int NV = launch.x * launch.y;
	typedef mgpu::less<typename std::iterator_traits<int*>::value_type> Comp;

    // BalancedPath search to establish partitions.
    MGPU_MEM(int) partitionsDevice = FindSetPartitions<false>(
		aKeysUnique->get(), aCountDupFree, bKeysUnique->get(), bCountDupFree, 
		NV, less<int>(), context);

	printf("FindSetPartitions:\n");
    PrintArray(*partitionsDevice, "%6d", 10 ); // numPart = 4

	MGPU_MEM(int) mergepartitionsDevice = MergePathPartitions<MgpuBoundsUpper>(
		aKeysUnique->get(), aCountDupFree, bKeysUnique->get(), bCountDupFree,
		NV, 0, less<int>(), context);

	printf("MergePathPartitions:\n");
    PrintArray(*mergepartitionsDevice, "%6d", 10 ); // numPart = 4

}

const int Tests[][2] = {
	{ 1500, 1 }
};
const int NumTests = sizeof(Tests) / sizeof(*Tests);

int main(int argc, char** argv) {
	ContextPtr context = CreateCudaDevice(argc, argv, true);

	typedef int T1;
	typedef int64 T2;

	for(int test = 0; test < 1; ++test)
		BenchmarkSetsPairs<T1>(Tests[test][0], Tests[test][1], *context);
	
	return 0;
}

Broken link in "Reduce and Scan"

The "Reduce and Scan" chapter tries to link to "NVIDIA's Mark Harris on Optimizing Parallel Reduction in CUDA" but uses

`https://moderngpu.github.io/http|//developer.download.nvidia.com/assets/cuda/files/reduction.pdf`

instead of

`http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf`

advice on tuning launch_params for transform_lbs

hi, thanks for the nice library. I'm using transform_lbs in a scenario where every work item needs to perform quite some computation and I'm unsure how to properly set the launch_params. Just reducing the values from the default launch_params_t<128, 11, 8> to something like launch_params_t<128, 3, 3> already increased the performance quite a bit. However, I'm unsure about how to set vt and vt0. should it be odd numbers? always vt > vt0? is the correctness affected if these numbers are not properly set? thanks for your help.

Argument swap in ReduceByKey() in demos.cu

Hi,
In the DemoReduceByKey method in demos.cu, 2 of the arguments given to the ReduceByKey method are mixed up:

ReduceByKey(keysDevice->get(), valsDevice->get(), count, 0, mgpu::plus(), mgpu::equal_to(), destDevice->get(), keysDestDevice->get(), &numSegments, (int*)0, context);

should be

ReduceByKey(keysDevice->get(), valsDevice->get(), count, 0, mgpu::plus(), mgpu::equal_to(), keysDestDevice->get(), destDevice->get(), &numSegments, (int*)0, context);

in order for the output to be printed correctly.

This should aso be fixed in the documentation - https://nvlabs.github.io/moderngpu/segreduce.html

Thanks!

Functionality similar to std::nth_element?

Hi,
I am looking for code that would do an efficient partial sort on the GPU. An issue in the thrust github repo points to moderngpu, but the link has gone stale:

http://www.moderngpu.com/select/radixselect.html

Is there anything like that available in the current release?
Thanks
Marcin

moderngpu on mac os x

anyone having (recent) experience on converting *.sln files to *.cmake?
successfully used one perl script in the past, now trying https://gist.github.com/sixman9/1781094

Avoid cudaMalloc and cudaFree within the mergesort kernel

Hello,
I am testing both moderngpu's mergesort and cub's radixsort for an implementation that I am currently working on.

I have benchmarked both and came to the conclusion that (at least for my data) the radix sort implementation from the CUB library is slightly faster due to the fact that it does not need several mallocs while sorting. On the other hand, moderngpu's mergesort appears to be performing mallocs every iteration, see below:

The box highlight in red is composed of several cudaMalloc and cudaFree.
I was wondering if it is possible to get rid of these allocs, since this would make moderngpu faster than CUB in my application. For instance, in CUB I can simply pass the pointers with pre-allocated memory to the kernel launch.

Is there anyway to do this in moderngpu?
Thank you for your time,
Esteban

----------Edit: more information:
I timed the time of one my apps run and found that the sys time is higher than that of CUB (I assume due to the cudaMallocs, because as it is, mgpu takes approx 10 to 30% less time.)
Run with CUB:
real 3m26.482s
user 3m10.728s
sys 0m15.584s

Run with Mgpu:
real 3m34.427s
user 3m11.548s
sys 0m22.884s

moderngpu / moderngpu Goto Github PK

moderngpu's Introduction

moderngpu

Quick Start Guide

How to Cite

moderngpu's People

Contributors

Stargazers

Watchers

Forkers

moderngpu's Issues

Context

Example

Error

Recommend Projects

Recommend Topics

Recommend Org