leofang / cupy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cupy/cupy

1.0 1.0 1.0 38.31 MB

NumPy-like API accelerated with CUDA

Home Page: https://cupy.chainer.org

License: MIT License

PowerShell 0.08% Batchfile 0.01% Python 58.68% C++ 10.01% C 8.89% Cuda 1.60% Dockerfile 0.40% Shell 0.25% Cython 20.07%

cupy's People

Contributors

Stargazers

Watchers

Forkers

mattkinsey

cupy's Issues

Add tutorial to RawKernel.compile()

see cupy/issues/1889

Question on F-order PlanNd

@grlee77 I have a quick question on lines like this:

cupy/cupy/fft/fft.py

Lines 287 to 288 in fa83171

    
           if order == 'F': 
        
               plan_dimensions = plan_dimensions[::-1]

I am having trouble understanding why it works. Could you explain to me? Thanks!

blahblahblah

CuPy master

routine: argwhere

shape	density	NumPy	CuPy
(1000000,)	0.1	2.117 ms	0.205 ms
(1000000,)	0.5	3.677 ms	0.205 ms
(1000, 1000)	0.1	2.963 ms	0.214 ms
(1000, 1000)	0.5	7.505 ms	0.222 ms
(10000000,)	0.1	15.239 ms	0.750 ms
(10000000,)	0.5	48.883 ms	0.798 ms
(3162, 3163)	0.1	30.753 ms	0.802 ms
(3162, 3163)	0.5	96.284 ms	0.818 ms
(100000000,)	0.1	174.880 ms	5.406 ms
(100000000,)	0.5	488.649 ms	5.329 ms
(10000, 10000)	0.1	342.146 ms	5.852 ms
(10000, 10000)	0.5	961.539 ms	5.976 ms

CuPy CUB

shape	density	NumPy	CuPy
(1000000,)	0.1	1.638 ms	0.199 ms
(1000000,)	0.5	3.664 ms	0.197 ms
(1000, 1000)	0.1	2.970 ms	0.220 ms
(1000, 1000)	0.5	7.551 ms	0.219 ms
(10000000,)	0.1	15.306 ms	0.684 ms
(10000000,)	0.5	48.676 ms	0.692 ms
(3162, 3163)	0.1	30.808 ms	0.740 ms
(3162, 3163)	0.5	95.971 ms	0.756 ms
(100000000,)	0.1	175.715 ms	4.786 ms
(100000000,)	0.5	486.734 ms	4.760 ms
(10000, 10000)	0.1	343.035 ms	5.189 ms
(10000, 10000)	0.5	958.509 ms	5.416 ms

New scan

shape	density	NumPy	CuPy
(1000000,)	0.1	1.489 ms	0.204 ms
(1000000,)	0.5	3.661 ms	0.244 ms
(1000, 1000)	0.1	3.018 ms	0.221 ms
(1000, 1000)	0.5	7.525 ms	0.224 ms
(10000000,)	0.1	15.071 ms	0.699 ms
(10000000,)	0.5	48.683 ms	0.705 ms
(3162, 3163)	0.1	30.551 ms	0.763 ms
(3162, 3163)	0.5	96.009 ms	0.778 ms
(100000000,)	0.1	174.189 ms	4.941 ms
(100000000,)	0.5	486.482 ms	4.997 ms
(10000, 10000)	0.1	340.625 ms	5.340 ms
(10000, 10000)	0.5	958.917 ms	5.513 ms

Investigating warp-based scan kernel for ROCm

Attempting to make the warp-based scan kernel (cupy#4315) work on ROCm/HIP...Changes are in https://github.com/leofang/cupy/tree/improve_scan_1. Only int32 and float32 have the required warp intrinsics.

CUB is enabled by setting CUPY_ACCELERATORS=cub.

On CuPy master + ROCm 3.5.0 + Radeon VII:

size	dtype	NumPy	CuPy	CuPy (cupy#4315)	CuPy (CUB)	note
1000000	int32	1.395 ms	0.154 ms	0.151 ms	0.105 ms	HIP has warp support
1000000	int64	1.406 ms	0.174 ms	0.175 ms	0.121 ms
1000000	float32	2.187 ms	0.155 ms	0.148 ms	0.107 ms	HIP has warp support
1000000	float64	2.189 ms	0.177 ms	0.177 ms	0.121 ms
10000000	int32	17.922 ms	0.618 ms	0.605 ms	0.365 ms	HIP has warp support
10000000	int64	23.124 ms	0.810 ms	0.815 ms	0.480 ms
10000000	float32	25.759 ms	0.621 ms	0.608 ms	0.368 ms	HIP has warp support
10000000	float64	30.172 ms	0.812 ms	0.818 ms	0.485 ms
100000000	int32	179.647 ms	5.289 ms	4.999 ms	2.842 ms	HIP has warp support
100000000	int64	231.564 ms	7.265 ms	7.238 ms	4.094 ms
100000000	float32	257.582 ms	5.305 ms	5.004 ms	2.843 ms	HIP has warp support; bad accuracy
100000000	float64	301.601 ms	7.304 ms	7.293 ms	4.131 ms

Looks like hipCUB/rocPRIM wins unconditionally.

Question on DtoD/peer copy

Hi @maxpkatz I hope you don’t mind me bugging you again 😅

Just one quick question (I hope) for clarifying the behavior: When using cudaMemcpyAsync or cudaMemcpyPeerAsync to copy between two devices, which device should the stream argument reside, source or destination? I found the documentation is unclear about this, and after some tests I suspect it’s source. Am I right?

Thanks!

thoughts

if backend='nvcc' is set, compiler should first check whether nvcc is actually available. This is for the extreme corner case (cudatoolkit from conda-forge does not come with nvcc).
CUB-backed generic reduction kernel (using BlockReduce)
fix data transfer serialization in the multi-GPU fft I recently added (related: #8)
investigate if fftn can be done by looping over multi-GPU fft
generic multi-GPU fftn
test shutdown bug by running subprocess with output captured (should be null).

To learn:

sparse matrix
linalg / SVD
string template based kernel generation technique

[WIP] CUB-backed generic reduction

Tested at the commit 6843030 fo the cupy_reduce_cub_backend2 branch. (With block size locally changed to 1024.)

Test script:

import sys
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import cupy as cp
from cupyx.time import repeat


shape = (256, 512, 512)
a = cp.random.random(shape)
CUB_supported = ['sum', 'prod', 'min', 'max', 'argmin', 'argmax']
REST = ['amin', 'amax', 'nanmin', 'nanmax', 'nanargmin', 'nanargmax',
        'mean', 'nanmean', 'var', 'nanvar', 'nansum', 'nanprod',
        'all', 'any', 'count_nonzero']

for reduce_func in CUB_supported + REST:
    for axis in [(2,), (1, 2), (0, 1, 2)]:
        func = getattr(cp, reduce_func)
        print("testing axis = ", axis, '...')

        cp.cuda.cub_enabled = False
        cp.core.cub_block_reduction_enabled = False
        data = repeat(func, (a, axis), n=100)
        results = [data._to_str_per_item('GPU', data.gpu_times)]
        print('{:<10s} (old kernel):{}'.format(reduce_func, ' '.join(results)))
        b = func(a, axis)

        if reduce_func in CUB_supported:
            cp.cuda.cub_enabled = True
            cp.core.cub_block_reduction_enabled = False
            data = repeat(func, (a, axis), n=100)
            results = [data._to_str_per_item('GPU', data.gpu_times)]
            print('{:<10s} (CUB device):{}'.format(reduce_func, ' '.join(results)))
            c = func(a, axis)
        else:
            print('{:<10s} (CUB device):{}'.format(reduce_func, '    (CUB device-wide reduction not available)'))

        cp.cuda.cub_enabled = False
        cp.core.cub_block_reduction_enabled = True
        data = repeat(func, (a, axis), n=100)
        results = [data._to_str_per_item('GPU', data.gpu_times)]
        print('{:<10s} (CUB blocks):{}'.format(reduce_func, ' '.join(results)))
        d = func(a, axis)

        try:
            assert cp.allclose(b, d)
        except AssertionError:
            print("Result not match! (function: {}, axis: {})".format(reduce_func, axis), file=sys.stderr)
            raise
        finally:
            print()
    print('----------------------------------------------------------------------')

Result (P100 + CUDA 9.2) -- CUB blocks is the new implementation

testing axis =  (2,) ...
sum        (old kernel):    GPU: 4981.387 us   +/-208.941 (min: 4831.776 / max: 5356.704) us
sum        (CUB device):    GPU: 1044.733 us   +/- 1.323 (min: 1042.208 / max: 1052.096) us
sum        (CUB blocks):    GPU:  974.125 us   +/-10.610 (min:  968.096 / max: 1039.008) us

testing axis =  (1, 2) ...
sum        (old kernel):    GPU: 1234.895 us   +/-51.552 (min: 1201.792 / max: 1346.144) us
sum        (CUB device):    GPU: 1032.296 us   +/- 2.303 (min: 1028.320 / max: 1041.312) us
sum        (CUB blocks):    GPU: 1044.719 us   +/- 2.730 (min: 1040.192 / max: 1052.864) us

testing axis =  (0, 1, 2) ...
sum        (old kernel):    GPU:60725.747 us   +/- 8.761 (min:60701.824 / max:60744.926) us
sum        (CUB device):    GPU:  997.767 us   +/- 3.168 (min:  991.648 / max: 1013.632) us
sum        (CUB blocks):    GPU:46982.573 us   +/- 8.591 (min:46961.792 / max:47007.584) us

----------------------------------------------------------------------
testing axis =  (2,) ...
prod       (old kernel):    GPU: 4814.530 us   +/- 2.763 (min: 4812.736 / max: 4829.632) us
prod       (CUB device):    GPU: 1010.767 us   +/- 2.397 (min: 1008.352 / max: 1029.568) us
prod       (CUB blocks):    GPU:  953.065 us   +/- 1.840 (min:  951.136 / max:  967.392) us

testing axis =  (1, 2) ...
prod       (old kernel):    GPU: 1224.603 us   +/-56.940 (min: 1182.944 / max: 1326.528) us
prod       (CUB device):    GPU:  997.257 us   +/- 3.251 (min:  993.344 / max: 1016.352) us
prod       (CUB blocks):    GPU: 1027.258 us   +/- 2.081 (min: 1024.032 / max: 1040.192) us

testing axis =  (0, 1, 2) ...
prod       (old kernel):    GPU:60724.508 us   +/- 9.653 (min:60708.256 / max:60786.015) us
prod       (CUB device):    GPU:  998.307 us   +/- 2.376 (min:  994.720 / max: 1009.760) us
prod       (CUB blocks):    GPU:46981.408 us   +/- 9.336 (min:46954.594 / max:47006.496) us

----------------------------------------------------------------------
testing axis =  (2,) ...
min        (old kernel):    GPU: 7280.696 us   +/- 3.647 (min: 7274.144 / max: 7300.736) us
min        (CUB device):    GPU: 1167.163 us   +/- 3.714 (min: 1161.408 / max: 1187.968) us
min        (CUB blocks):    GPU: 1052.274 us   +/- 1.791 (min: 1050.368 / max: 1066.528) us

testing axis =  (1, 2) ...
min        (old kernel):    GPU: 1542.100 us   +/- 4.306 (min: 1537.056 / max: 1574.688) us
min        (CUB device):    GPU: 1002.540 us   +/- 4.994 (min:  997.568 / max: 1035.200) us
min        (CUB blocks):    GPU: 1536.580 us   +/- 2.408 (min: 1531.008 / max: 1551.200) us

testing axis =  (0, 1, 2) ...
min        (old kernel):    GPU:68409.857 us   +/-28.736 (min:68342.529 / max:68475.166) us
min        (CUB device):    GPU: 1135.014 us   +/- 3.722 (min: 1129.376 / max: 1153.120) us
min        (CUB blocks):    GPU:100779.564 us   +/-49.208 (min:100715.485 / max:100850.014) us

----------------------------------------------------------------------
testing axis =  (2,) ...
max        (old kernel):    GPU: 7279.699 us   +/- 3.361 (min: 7273.632 / max: 7298.848) us
max        (CUB device):    GPU: 1167.538 us   +/- 5.012 (min: 1160.928 / max: 1193.696) us
max        (CUB blocks):    GPU: 1052.600 us   +/- 1.783 (min: 1050.784 / max: 1065.824) us

testing axis =  (1, 2) ...
max        (old kernel):    GPU: 1541.445 us   +/- 2.969 (min: 1536.320 / max: 1554.592) us
max        (CUB device):    GPU: 1002.808 us   +/- 5.097 (min:  997.920 / max: 1024.800) us
max        (CUB blocks):    GPU: 1538.108 us   +/- 3.368 (min: 1532.960 / max: 1557.952) us

testing axis =  (0, 1, 2) ...
max        (old kernel):    GPU:68411.174 us   +/-26.093 (min:68351.715 / max:68464.447) us
max        (CUB device):    GPU: 1135.247 us   +/- 3.801 (min: 1128.288 / max: 1152.608) us
max        (CUB blocks):    GPU:101066.898 us   +/-24.490 (min:101019.295 / max:101112.221) us

----------------------------------------------------------------------
testing axis =  (2,) ...
argmin     (old kernel):    GPU: 7284.797 us   +/- 2.291 (min: 7280.800 / max: 7292.480) us
argmin     (CUB device):    GPU: 7291.742 us   +/- 2.886 (min: 7286.016 / max: 7306.240) us
argmin     (CUB blocks):    GPU: 1901.183 us   +/- 4.003 (min: 1898.272 / max: 1919.168) us

testing axis =  (1, 2) ...
argmin     (old kernel):    GPU: 1665.333 us   +/- 2.907 (min: 1659.968 / max: 1676.032) us
argmin     (CUB device):    GPU: 1672.449 us   +/- 4.058 (min: 1664.736 / max: 1693.280) us
argmin     (CUB blocks):    GPU: 3336.875 us   +/- 5.334 (min: 3329.856 / max: 3371.680) us

testing axis =  (0, 1, 2) ...
argmin     (old kernel):    GPU:80236.562 us   +/-16.019 (min:80201.378 / max:80276.733) us
argmin     (CUB device):    GPU:  977.714 us   +/- 2.309 (min:  973.472 / max:  992.192) us
argmin     (CUB blocks):    GPU:175400.849 us   +/-16.974 (min:175360.123 / max:175450.790) us

----------------------------------------------------------------------
testing axis =  (2,) ...
argmax     (old kernel):    GPU: 7284.120 us   +/- 2.183 (min: 7279.680 / max: 7290.464) us
argmax     (CUB device):    GPU: 7291.568 us   +/- 2.834 (min: 7286.784 / max: 7306.656) us
argmax     (CUB blocks):    GPU: 1900.549 us   +/- 2.215 (min: 1897.888 / max: 1914.624) us

testing axis =  (1, 2) ...
argmax     (old kernel):    GPU: 1665.754 us   +/- 3.971 (min: 1660.224 / max: 1685.280) us
argmax     (CUB device):    GPU: 1672.875 us   +/- 3.158 (min: 1665.984 / max: 1681.184) us
argmax     (CUB blocks):    GPU: 3333.666 us   +/- 5.844 (min: 3324.992 / max: 3363.712) us

testing axis =  (0, 1, 2) ...
argmax     (old kernel):    GPU:80241.476 us   +/-20.790 (min:80196.579 / max:80329.826) us
argmax     (CUB device):    GPU:  978.660 us   +/- 1.914 (min:  975.168 / max:  987.360) us
argmax     (CUB blocks):    GPU:174229.262 us   +/-13.119 (min:174202.713 / max:174262.527) us

----------------------------------------------------------------------
testing axis =  (2,) ...
amin       (old kernel):    GPU: 7278.928 us   +/- 2.520 (min: 7274.656 / max: 7286.016) us
amin       (CUB device):    (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 1052.410 us   +/- 1.762 (min: 1050.688 / max: 1066.400) us

testing axis =  (1, 2) ...
amin       (old kernel):    GPU: 1541.058 us   +/- 2.964 (min: 1537.760 / max: 1558.752) us
amin       (CUB device):    (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 1536.903 us   +/- 2.836 (min: 1532.384 / max: 1549.696) us

testing axis =  (0, 1, 2) ...
amin       (old kernel):    GPU:68411.164 us   +/-30.626 (min:68315.712 / max:68490.082) us
amin       (CUB device):    (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU:100777.386 us   +/-50.680 (min:100704.124 / max:100849.632) us

----------------------------------------------------------------------
testing axis =  (2,) ...
amax       (old kernel):    GPU: 7279.396 us   +/- 3.273 (min: 7272.288 / max: 7294.400) us
amax       (CUB device):    (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 1052.501 us   +/- 1.848 (min: 1050.272 / max: 1065.088) us

testing axis =  (1, 2) ...
amax       (old kernel):    GPU: 1541.122 us   +/- 3.032 (min: 1535.328 / max: 1555.648) us
amax       (CUB device):    (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 1538.018 us   +/- 3.088 (min: 1534.112 / max: 1552.320) us

testing axis =  (0, 1, 2) ...
amax       (old kernel):    GPU:68409.406 us   +/-29.267 (min:68310.623 / max:68465.630) us
amax       (CUB device):    (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU:101066.076 us   +/-23.849 (min:101021.507 / max:101114.883) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nanmin     (old kernel):    GPU: 6995.210 us   +/- 4.527 (min: 6985.952 / max: 7012.544) us
nanmin     (CUB device):    (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU: 1076.651 us   +/-13.292 (min: 1055.840 / max: 1118.304) us

testing axis =  (1, 2) ...
nanmin     (old kernel):    GPU: 1541.821 us   +/-28.587 (min: 1446.752 / max: 1583.264) us
nanmin     (CUB device):    (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU: 1272.550 us   +/-39.880 (min: 1203.328 / max: 1315.168) us

testing axis =  (0, 1, 2) ...
nanmin     (old kernel):    GPU:62372.098 us   +/- 7.051 (min:62359.039 / max:62395.809) us
nanmin     (CUB device):    (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU:73762.741 us   +/- 6.750 (min:73742.653 / max:73780.479) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nanmax     (old kernel):    GPU: 6994.817 us   +/- 6.036 (min: 6985.376 / max: 7023.648) us
nanmax     (CUB device):    (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU: 1049.083 us   +/- 5.312 (min: 1040.704 / max: 1071.232) us

testing axis =  (1, 2) ...
nanmax     (old kernel):    GPU: 1545.598 us   +/-19.431 (min: 1447.776 / max: 1568.640) us
nanmax     (CUB device):    (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU: 1299.933 us   +/-45.486 (min: 1221.440 / max: 1371.712) us

testing axis =  (0, 1, 2) ...
nanmax     (old kernel):    GPU:62374.201 us   +/- 8.502 (min:62355.873 / max:62394.623) us
nanmax     (CUB device):    (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU:73761.331 us   +/- 5.107 (min:73749.283 / max:73777.054) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nanargmin  (old kernel):    GPU: 7521.018 us   +/- 8.955 (min: 7497.728 / max: 7550.592) us
nanargmin  (CUB device):    (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 1973.085 us   +/-12.519 (min: 1956.928 / max: 2008.640) us

testing axis =  (1, 2) ...
nanargmin  (old kernel):    GPU: 1849.730 us   +/- 3.769 (min: 1842.944 / max: 1871.520) us
nanargmin  (CUB device):    (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 3707.225 us   +/-17.858 (min: 3687.136 / max: 3776.480) us

testing axis =  (0, 1, 2) ...
nanargmin  (old kernel):    GPU:81803.034 us   +/-14.609 (min:81770.882 / max:81840.797) us
nanargmin  (CUB device):    (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU:190295.268 us   +/-18.013 (min:190263.840 / max:190389.053) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nanargmax  (old kernel):    GPU: 7520.069 us   +/- 8.667 (min: 7498.752 / max: 7543.712) us
nanargmax  (CUB device):    (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 1971.927 us   +/-11.871 (min: 1956.960 / max: 2012.192) us

testing axis =  (1, 2) ...
nanargmax  (old kernel):    GPU: 1856.170 us   +/- 3.025 (min: 1851.296 / max: 1871.520) us
nanargmax  (CUB device):    (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 3695.171 us   +/-12.590 (min: 3679.200 / max: 3732.896) us

testing axis =  (0, 1, 2) ...
nanargmax  (old kernel):    GPU:81804.032 us   +/-15.481 (min:81766.975 / max:81843.262) us
nanargmax  (CUB device):    (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU:190926.042 us   +/-12.581 (min:190897.018 / max:190953.796) us

----------------------------------------------------------------------
testing axis =  (2,) ...
mean       (old kernel):    GPU: 4919.463 us   +/- 2.347 (min: 4917.248 / max: 4932.576) us
mean       (CUB device):    (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU:  984.885 us   +/-15.563 (min:  957.888 / max: 1039.520) us

testing axis =  (1, 2) ...
mean       (old kernel):    GPU: 1250.211 us   +/-61.111 (min: 1193.696 / max: 1351.392) us
mean       (CUB device):    (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU: 1069.789 us   +/-17.287 (min: 1035.040 / max: 1138.208) us

testing axis =  (0, 1, 2) ...
mean       (old kernel):    GPU:60708.971 us   +/-10.122 (min:60692.577 / max:60769.440) us
mean       (CUB device):    (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU:46983.943 us   +/-10.468 (min:46957.314 / max:47013.695) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nanmean    (old kernel):    GPU: 5708.619 us   +/- 2.875 (min: 5705.440 / max: 5726.112) us
nanmean    (CUB device):    (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU:  992.895 us   +/-14.937 (min:  959.744 / max: 1035.840) us

testing axis =  (1, 2) ...
nanmean    (old kernel):    GPU: 1553.189 us   +/- 4.186 (min: 1545.536 / max: 1566.496) us
nanmean    (CUB device):    (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU: 1158.552 us   +/-15.002 (min: 1134.752 / max: 1205.888) us

testing axis =  (0, 1, 2) ...
nanmean    (old kernel):    GPU:66023.918 us   +/-35.811 (min:65941.635 / max:66112.961) us
nanmean    (CUB device):    (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU:49554.622 us   +/- 7.062 (min:49534.206 / max:49575.775) us

----------------------------------------------------------------------
testing axis =  (2,) ...
var        (old kernel):    GPU: 6640.631 us   +/- 7.249 (min: 6623.936 / max: 6664.608) us
var        (CUB device):    (CUB device-wide reduction not available)
var        (CUB blocks):    GPU: 2680.171 us   +/- 4.932 (min: 2669.120 / max: 2692.480) us

testing axis =  (1, 2) ...
var        (old kernel):    GPU:15180.429 us   +/-141.299 (min:14959.104 / max:15639.168) us
var        (CUB device):    (CUB device-wide reduction not available)
var        (CUB blocks):    GPU:14967.292 us   +/-144.692 (min:14789.376 / max:15738.304) us

testing axis =  (0, 1, 2) ...
var        (old kernel):    GPU:1871272.574 us   +/-304.004 (min:1870493.774 / max:1871963.867) us
var        (CUB device):    (CUB device-wide reduction not available)
var        (CUB blocks):    GPU:114163.430 us   +/-17.509 (min:114111.069 / max:114227.264) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nanvar     (old kernel):    GPU:11511.579 us   +/- 2.125 (min:11508.064 / max:11520.352) us
nanvar     (CUB device):    (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU: 4742.906 us   +/- 2.971 (min: 4738.720 / max: 4759.840) us

testing axis =  (1, 2) ...
nanvar     (old kernel):    GPU:27261.787 us   +/-14.532 (min:27224.735 / max:27303.360) us
nanvar     (CUB device):    (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU:26852.691 us   +/-16.575 (min:26819.040 / max:26892.769) us

testing axis =  (0, 1, 2) ...
nanvar     (old kernel):    GPU:4169472.632 us   +/-66.999 (min:4169318.848 / max:4169709.961) us
nanvar     (CUB device):    (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU:278711.689 us   +/-40.308 (min:278618.988 / max:278808.868) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nansum     (old kernel):    GPU: 4344.424 us   +/- 2.046 (min: 4342.048 / max: 4357.792) us
nansum     (CUB device):    (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU:  953.723 us   +/- 1.024 (min:  952.000 / max:  958.880) us

testing axis =  (1, 2) ...
nansum     (old kernel):    GPU: 1246.913 us   +/- 2.604 (min: 1242.432 / max: 1259.712) us
nansum     (CUB device):    (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU: 1053.856 us   +/- 3.706 (min: 1049.184 / max: 1069.664) us

testing axis =  (0, 1, 2) ...
nansum     (old kernel):    GPU:63447.723 us   +/- 6.865 (min:63433.632 / max:63474.049) us
nansum     (CUB device):    (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU:47387.075 us   +/- 2.659 (min:47384.960 / max:47401.279) us

----------------------------------------------------------------------
testing axis =  (2,) ...
nanprod    (old kernel):    GPU: 4345.197 us   +/- 2.536 (min: 4342.880 / max: 4358.656) us
nanprod    (CUB device):    (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU:  993.034 us   +/-17.009 (min:  961.696 / max: 1055.584) us

testing axis =  (1, 2) ...
nanprod    (old kernel):    GPU: 1252.908 us   +/- 5.268 (min: 1245.056 / max: 1266.624) us
nanprod    (CUB device):    (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU: 1088.906 us   +/-14.730 (min: 1069.088 / max: 1148.384) us

testing axis =  (0, 1, 2) ...
nanprod    (old kernel):    GPU:63449.758 us   +/- 6.428 (min:63435.360 / max:63466.496) us
nanprod    (CUB device):    (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU:47388.084 us   +/- 8.357 (min:47383.457 / max:47465.439) us

----------------------------------------------------------------------
testing axis =  (2,) ...
all        (old kernel):    GPU: 4558.527 us   +/- 8.207 (min: 4552.864 / max: 4594.304) us
all        (CUB device):    (CUB device-wide reduction not available)
all        (CUB blocks):    GPU:  987.198 us   +/-16.190 (min:  965.152 / max: 1054.752) us

testing axis =  (1, 2) ...
all        (old kernel):    GPU: 1218.613 us   +/-14.359 (min: 1210.784 / max: 1357.472) us
all        (CUB device):    (CUB device-wide reduction not available)
all        (CUB blocks):    GPU: 1054.616 us   +/-15.980 (min: 1030.752 / max: 1114.624) us

testing axis =  (0, 1, 2) ...
all        (old kernel):    GPU:63045.030 us   +/- 3.270 (min:63037.983 / max:63056.030) us
all        (CUB device):    (CUB device-wide reduction not available)
all        (CUB blocks):    GPU:44248.783 us   +/- 8.405 (min:44229.759 / max:44272.224) us

----------------------------------------------------------------------
testing axis =  (2,) ...
any        (old kernel):    GPU: 4551.490 us   +/- 2.726 (min: 4548.736 / max: 4565.888) us
any        (CUB device):    (CUB device-wide reduction not available)
any        (CUB blocks):    GPU:  980.970 us   +/-14.647 (min:  962.368 / max: 1031.104) us

testing axis =  (1, 2) ...
any        (old kernel):    GPU: 1227.028 us   +/- 6.295 (min: 1216.128 / max: 1253.536) us
any        (CUB device):    (CUB device-wide reduction not available)
any        (CUB blocks):    GPU: 1064.959 us   +/-13.211 (min: 1047.040 / max: 1118.752) us

testing axis =  (0, 1, 2) ...
any        (old kernel):    GPU:63815.470 us   +/-11.096 (min:63790.302 / max:63843.521) us
any        (CUB device):    (CUB device-wide reduction not available)
any        (CUB blocks):    GPU:48141.098 us   +/- 6.768 (min:48132.000 / max:48183.426) us

----------------------------------------------------------------------
testing axis =  (2,) ...
count_nonzero (old kernel):    GPU: 4333.822 us   +/- 7.074 (min: 4328.640 / max: 4365.376) us
count_nonzero (CUB device):    (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU:  977.654 us   +/-11.242 (min:  962.656 / max: 1034.304) us

testing axis =  (1, 2) ...
count_nonzero (old kernel):    GPU: 1238.971 us   +/- 4.179 (min: 1232.992 / max: 1253.280) us
count_nonzero (CUB device):    (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU: 1060.503 us   +/-13.000 (min: 1043.104 / max: 1112.512) us

testing axis =  (0, 1, 2) ...
count_nonzero (old kernel):    GPU:64118.497 us   +/-20.636 (min:64062.462 / max:64168.701) us
count_nonzero (CUB device):    (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU:47377.478 us   +/- 3.837 (min:47369.022 / max:47395.615) us

----------------------------------------------------------------------

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.