leofang / cupy Goto Github PK
View Code? Open in Web Editor NEWThis project forked from cupy/cupy
NumPy-like API accelerated with CUDA
Home Page: https://cupy.chainer.org
License: MIT License
This project forked from cupy/cupy
NumPy-like API accelerated with CUDA
Home Page: https://cupy.chainer.org
License: MIT License
see cupy/issues/1889
CuPy master
routine: argwhere
shape | density | NumPy | CuPy |
---|---|---|---|
(1000000,) | 0.1 | 2.117 ms | 0.205 ms |
(1000000,) | 0.5 | 3.677 ms | 0.205 ms |
(1000, 1000) | 0.1 | 2.963 ms | 0.214 ms |
(1000, 1000) | 0.5 | 7.505 ms | 0.222 ms |
(10000000,) | 0.1 | 15.239 ms | 0.750 ms |
(10000000,) | 0.5 | 48.883 ms | 0.798 ms |
(3162, 3163) | 0.1 | 30.753 ms | 0.802 ms |
(3162, 3163) | 0.5 | 96.284 ms | 0.818 ms |
(100000000,) | 0.1 | 174.880 ms | 5.406 ms |
(100000000,) | 0.5 | 488.649 ms | 5.329 ms |
(10000, 10000) | 0.1 | 342.146 ms | 5.852 ms |
(10000, 10000) | 0.5 | 961.539 ms | 5.976 ms |
CuPy CUB
shape | density | NumPy | CuPy |
---|---|---|---|
(1000000,) | 0.1 | 1.638 ms | 0.199 ms |
(1000000,) | 0.5 | 3.664 ms | 0.197 ms |
(1000, 1000) | 0.1 | 2.970 ms | 0.220 ms |
(1000, 1000) | 0.5 | 7.551 ms | 0.219 ms |
(10000000,) | 0.1 | 15.306 ms | 0.684 ms |
(10000000,) | 0.5 | 48.676 ms | 0.692 ms |
(3162, 3163) | 0.1 | 30.808 ms | 0.740 ms |
(3162, 3163) | 0.5 | 95.971 ms | 0.756 ms |
(100000000,) | 0.1 | 175.715 ms | 4.786 ms |
(100000000,) | 0.5 | 486.734 ms | 4.760 ms |
(10000, 10000) | 0.1 | 343.035 ms | 5.189 ms |
(10000, 10000) | 0.5 | 958.509 ms | 5.416 ms |
New scan
shape | density | NumPy | CuPy |
---|---|---|---|
(1000000,) | 0.1 | 1.489 ms | 0.204 ms |
(1000000,) | 0.5 | 3.661 ms | 0.244 ms |
(1000, 1000) | 0.1 | 3.018 ms | 0.221 ms |
(1000, 1000) | 0.5 | 7.525 ms | 0.224 ms |
(10000000,) | 0.1 | 15.071 ms | 0.699 ms |
(10000000,) | 0.5 | 48.683 ms | 0.705 ms |
(3162, 3163) | 0.1 | 30.551 ms | 0.763 ms |
(3162, 3163) | 0.5 | 96.009 ms | 0.778 ms |
(100000000,) | 0.1 | 174.189 ms | 4.941 ms |
(100000000,) | 0.5 | 486.482 ms | 4.997 ms |
(10000, 10000) | 0.1 | 340.625 ms | 5.340 ms |
(10000, 10000) | 0.5 | 958.917 ms | 5.513 ms |
Attempting to make the warp-based scan kernel (cupy#4315) work on ROCm/HIP...Changes are in https://github.com/leofang/cupy/tree/improve_scan_1. Only int32
and float32
have the required warp intrinsics.
CUB is enabled by setting CUPY_ACCELERATORS=cub
.
On CuPy master + ROCm 3.5.0 + Radeon VII:
size | dtype | NumPy | CuPy | CuPy (cupy#4315) | CuPy (CUB) | note |
---|---|---|---|---|---|---|
1000000 | int32 | 1.395 ms | 0.154 ms | 0.151 ms | 0.105 ms | HIP has warp support |
1000000 | int64 | 1.406 ms | 0.174 ms | 0.175 ms | 0.121 ms | |
1000000 | float32 | 2.187 ms | 0.155 ms | 0.148 ms | 0.107 ms | HIP has warp support |
1000000 | float64 | 2.189 ms | 0.177 ms | 0.177 ms | 0.121 ms | |
10000000 | int32 | 17.922 ms | 0.618 ms | 0.605 ms | 0.365 ms | HIP has warp support |
10000000 | int64 | 23.124 ms | 0.810 ms | 0.815 ms | 0.480 ms | |
10000000 | float32 | 25.759 ms | 0.621 ms | 0.608 ms | 0.368 ms | HIP has warp support |
10000000 | float64 | 30.172 ms | 0.812 ms | 0.818 ms | 0.485 ms | |
100000000 | int32 | 179.647 ms | 5.289 ms | 4.999 ms | 2.842 ms | HIP has warp support |
100000000 | int64 | 231.564 ms | 7.265 ms | 7.238 ms | 4.094 ms | |
100000000 | float32 | 257.582 ms | 5.305 ms | 5.004 ms | 2.843 ms | HIP has warp support; bad accuracy |
100000000 | float64 | 301.601 ms | 7.304 ms | 7.293 ms | 4.131 ms |
Looks like hipCUB/rocPRIM wins unconditionally.
Hi @maxpkatz I hope you don’t mind me bugging you again
Just one quick question (I hope) for clarifying the behavior: When using cudaMemcpyAsync
or cudaMemcpyPeerAsync
to copy between two devices, which device should the stream
argument reside, source or destination? I found the documentation is unclear about this, and after some tests I suspect it’s source. Am I right?
Thanks!
backend='nvcc'
is set, compiler should first check whether nvcc is actually available. This is for the extreme corner case (cudatoolkit
from conda-forge does not come with nvcc).BlockReduce
)fft
I recently added (related: #8)fftn
can be done by looping over multi-GPU fft
fftn
To learn:
Tested at the commit 6843030 fo the cupy_reduce_cub_backend2 branch. (With block size locally changed to 1024.)
Test script:
import sys
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import cupy as cp
from cupyx.time import repeat
shape = (256, 512, 512)
a = cp.random.random(shape)
CUB_supported = ['sum', 'prod', 'min', 'max', 'argmin', 'argmax']
REST = ['amin', 'amax', 'nanmin', 'nanmax', 'nanargmin', 'nanargmax',
'mean', 'nanmean', 'var', 'nanvar', 'nansum', 'nanprod',
'all', 'any', 'count_nonzero']
for reduce_func in CUB_supported + REST:
for axis in [(2,), (1, 2), (0, 1, 2)]:
func = getattr(cp, reduce_func)
print("testing axis = ", axis, '...')
cp.cuda.cub_enabled = False
cp.core.cub_block_reduction_enabled = False
data = repeat(func, (a, axis), n=100)
results = [data._to_str_per_item('GPU', data.gpu_times)]
print('{:<10s} (old kernel):{}'.format(reduce_func, ' '.join(results)))
b = func(a, axis)
if reduce_func in CUB_supported:
cp.cuda.cub_enabled = True
cp.core.cub_block_reduction_enabled = False
data = repeat(func, (a, axis), n=100)
results = [data._to_str_per_item('GPU', data.gpu_times)]
print('{:<10s} (CUB device):{}'.format(reduce_func, ' '.join(results)))
c = func(a, axis)
else:
print('{:<10s} (CUB device):{}'.format(reduce_func, ' (CUB device-wide reduction not available)'))
cp.cuda.cub_enabled = False
cp.core.cub_block_reduction_enabled = True
data = repeat(func, (a, axis), n=100)
results = [data._to_str_per_item('GPU', data.gpu_times)]
print('{:<10s} (CUB blocks):{}'.format(reduce_func, ' '.join(results)))
d = func(a, axis)
try:
assert cp.allclose(b, d)
except AssertionError:
print("Result not match! (function: {}, axis: {})".format(reduce_func, axis), file=sys.stderr)
raise
finally:
print()
print('----------------------------------------------------------------------')
Result (P100 + CUDA 9.2) -- CUB blocks
is the new implementation
testing axis = (2,) ...
sum (old kernel): GPU: 4981.387 us +/-208.941 (min: 4831.776 / max: 5356.704) us
sum (CUB device): GPU: 1044.733 us +/- 1.323 (min: 1042.208 / max: 1052.096) us
sum (CUB blocks): GPU: 974.125 us +/-10.610 (min: 968.096 / max: 1039.008) us
testing axis = (1, 2) ...
sum (old kernel): GPU: 1234.895 us +/-51.552 (min: 1201.792 / max: 1346.144) us
sum (CUB device): GPU: 1032.296 us +/- 2.303 (min: 1028.320 / max: 1041.312) us
sum (CUB blocks): GPU: 1044.719 us +/- 2.730 (min: 1040.192 / max: 1052.864) us
testing axis = (0, 1, 2) ...
sum (old kernel): GPU:60725.747 us +/- 8.761 (min:60701.824 / max:60744.926) us
sum (CUB device): GPU: 997.767 us +/- 3.168 (min: 991.648 / max: 1013.632) us
sum (CUB blocks): GPU:46982.573 us +/- 8.591 (min:46961.792 / max:47007.584) us
----------------------------------------------------------------------
testing axis = (2,) ...
prod (old kernel): GPU: 4814.530 us +/- 2.763 (min: 4812.736 / max: 4829.632) us
prod (CUB device): GPU: 1010.767 us +/- 2.397 (min: 1008.352 / max: 1029.568) us
prod (CUB blocks): GPU: 953.065 us +/- 1.840 (min: 951.136 / max: 967.392) us
testing axis = (1, 2) ...
prod (old kernel): GPU: 1224.603 us +/-56.940 (min: 1182.944 / max: 1326.528) us
prod (CUB device): GPU: 997.257 us +/- 3.251 (min: 993.344 / max: 1016.352) us
prod (CUB blocks): GPU: 1027.258 us +/- 2.081 (min: 1024.032 / max: 1040.192) us
testing axis = (0, 1, 2) ...
prod (old kernel): GPU:60724.508 us +/- 9.653 (min:60708.256 / max:60786.015) us
prod (CUB device): GPU: 998.307 us +/- 2.376 (min: 994.720 / max: 1009.760) us
prod (CUB blocks): GPU:46981.408 us +/- 9.336 (min:46954.594 / max:47006.496) us
----------------------------------------------------------------------
testing axis = (2,) ...
min (old kernel): GPU: 7280.696 us +/- 3.647 (min: 7274.144 / max: 7300.736) us
min (CUB device): GPU: 1167.163 us +/- 3.714 (min: 1161.408 / max: 1187.968) us
min (CUB blocks): GPU: 1052.274 us +/- 1.791 (min: 1050.368 / max: 1066.528) us
testing axis = (1, 2) ...
min (old kernel): GPU: 1542.100 us +/- 4.306 (min: 1537.056 / max: 1574.688) us
min (CUB device): GPU: 1002.540 us +/- 4.994 (min: 997.568 / max: 1035.200) us
min (CUB blocks): GPU: 1536.580 us +/- 2.408 (min: 1531.008 / max: 1551.200) us
testing axis = (0, 1, 2) ...
min (old kernel): GPU:68409.857 us +/-28.736 (min:68342.529 / max:68475.166) us
min (CUB device): GPU: 1135.014 us +/- 3.722 (min: 1129.376 / max: 1153.120) us
min (CUB blocks): GPU:100779.564 us +/-49.208 (min:100715.485 / max:100850.014) us
----------------------------------------------------------------------
testing axis = (2,) ...
max (old kernel): GPU: 7279.699 us +/- 3.361 (min: 7273.632 / max: 7298.848) us
max (CUB device): GPU: 1167.538 us +/- 5.012 (min: 1160.928 / max: 1193.696) us
max (CUB blocks): GPU: 1052.600 us +/- 1.783 (min: 1050.784 / max: 1065.824) us
testing axis = (1, 2) ...
max (old kernel): GPU: 1541.445 us +/- 2.969 (min: 1536.320 / max: 1554.592) us
max (CUB device): GPU: 1002.808 us +/- 5.097 (min: 997.920 / max: 1024.800) us
max (CUB blocks): GPU: 1538.108 us +/- 3.368 (min: 1532.960 / max: 1557.952) us
testing axis = (0, 1, 2) ...
max (old kernel): GPU:68411.174 us +/-26.093 (min:68351.715 / max:68464.447) us
max (CUB device): GPU: 1135.247 us +/- 3.801 (min: 1128.288 / max: 1152.608) us
max (CUB blocks): GPU:101066.898 us +/-24.490 (min:101019.295 / max:101112.221) us
----------------------------------------------------------------------
testing axis = (2,) ...
argmin (old kernel): GPU: 7284.797 us +/- 2.291 (min: 7280.800 / max: 7292.480) us
argmin (CUB device): GPU: 7291.742 us +/- 2.886 (min: 7286.016 / max: 7306.240) us
argmin (CUB blocks): GPU: 1901.183 us +/- 4.003 (min: 1898.272 / max: 1919.168) us
testing axis = (1, 2) ...
argmin (old kernel): GPU: 1665.333 us +/- 2.907 (min: 1659.968 / max: 1676.032) us
argmin (CUB device): GPU: 1672.449 us +/- 4.058 (min: 1664.736 / max: 1693.280) us
argmin (CUB blocks): GPU: 3336.875 us +/- 5.334 (min: 3329.856 / max: 3371.680) us
testing axis = (0, 1, 2) ...
argmin (old kernel): GPU:80236.562 us +/-16.019 (min:80201.378 / max:80276.733) us
argmin (CUB device): GPU: 977.714 us +/- 2.309 (min: 973.472 / max: 992.192) us
argmin (CUB blocks): GPU:175400.849 us +/-16.974 (min:175360.123 / max:175450.790) us
----------------------------------------------------------------------
testing axis = (2,) ...
argmax (old kernel): GPU: 7284.120 us +/- 2.183 (min: 7279.680 / max: 7290.464) us
argmax (CUB device): GPU: 7291.568 us +/- 2.834 (min: 7286.784 / max: 7306.656) us
argmax (CUB blocks): GPU: 1900.549 us +/- 2.215 (min: 1897.888 / max: 1914.624) us
testing axis = (1, 2) ...
argmax (old kernel): GPU: 1665.754 us +/- 3.971 (min: 1660.224 / max: 1685.280) us
argmax (CUB device): GPU: 1672.875 us +/- 3.158 (min: 1665.984 / max: 1681.184) us
argmax (CUB blocks): GPU: 3333.666 us +/- 5.844 (min: 3324.992 / max: 3363.712) us
testing axis = (0, 1, 2) ...
argmax (old kernel): GPU:80241.476 us +/-20.790 (min:80196.579 / max:80329.826) us
argmax (CUB device): GPU: 978.660 us +/- 1.914 (min: 975.168 / max: 987.360) us
argmax (CUB blocks): GPU:174229.262 us +/-13.119 (min:174202.713 / max:174262.527) us
----------------------------------------------------------------------
testing axis = (2,) ...
amin (old kernel): GPU: 7278.928 us +/- 2.520 (min: 7274.656 / max: 7286.016) us
amin (CUB device): (CUB device-wide reduction not available)
amin (CUB blocks): GPU: 1052.410 us +/- 1.762 (min: 1050.688 / max: 1066.400) us
testing axis = (1, 2) ...
amin (old kernel): GPU: 1541.058 us +/- 2.964 (min: 1537.760 / max: 1558.752) us
amin (CUB device): (CUB device-wide reduction not available)
amin (CUB blocks): GPU: 1536.903 us +/- 2.836 (min: 1532.384 / max: 1549.696) us
testing axis = (0, 1, 2) ...
amin (old kernel): GPU:68411.164 us +/-30.626 (min:68315.712 / max:68490.082) us
amin (CUB device): (CUB device-wide reduction not available)
amin (CUB blocks): GPU:100777.386 us +/-50.680 (min:100704.124 / max:100849.632) us
----------------------------------------------------------------------
testing axis = (2,) ...
amax (old kernel): GPU: 7279.396 us +/- 3.273 (min: 7272.288 / max: 7294.400) us
amax (CUB device): (CUB device-wide reduction not available)
amax (CUB blocks): GPU: 1052.501 us +/- 1.848 (min: 1050.272 / max: 1065.088) us
testing axis = (1, 2) ...
amax (old kernel): GPU: 1541.122 us +/- 3.032 (min: 1535.328 / max: 1555.648) us
amax (CUB device): (CUB device-wide reduction not available)
amax (CUB blocks): GPU: 1538.018 us +/- 3.088 (min: 1534.112 / max: 1552.320) us
testing axis = (0, 1, 2) ...
amax (old kernel): GPU:68409.406 us +/-29.267 (min:68310.623 / max:68465.630) us
amax (CUB device): (CUB device-wide reduction not available)
amax (CUB blocks): GPU:101066.076 us +/-23.849 (min:101021.507 / max:101114.883) us
----------------------------------------------------------------------
testing axis = (2,) ...
nanmin (old kernel): GPU: 6995.210 us +/- 4.527 (min: 6985.952 / max: 7012.544) us
nanmin (CUB device): (CUB device-wide reduction not available)
nanmin (CUB blocks): GPU: 1076.651 us +/-13.292 (min: 1055.840 / max: 1118.304) us
testing axis = (1, 2) ...
nanmin (old kernel): GPU: 1541.821 us +/-28.587 (min: 1446.752 / max: 1583.264) us
nanmin (CUB device): (CUB device-wide reduction not available)
nanmin (CUB blocks): GPU: 1272.550 us +/-39.880 (min: 1203.328 / max: 1315.168) us
testing axis = (0, 1, 2) ...
nanmin (old kernel): GPU:62372.098 us +/- 7.051 (min:62359.039 / max:62395.809) us
nanmin (CUB device): (CUB device-wide reduction not available)
nanmin (CUB blocks): GPU:73762.741 us +/- 6.750 (min:73742.653 / max:73780.479) us
----------------------------------------------------------------------
testing axis = (2,) ...
nanmax (old kernel): GPU: 6994.817 us +/- 6.036 (min: 6985.376 / max: 7023.648) us
nanmax (CUB device): (CUB device-wide reduction not available)
nanmax (CUB blocks): GPU: 1049.083 us +/- 5.312 (min: 1040.704 / max: 1071.232) us
testing axis = (1, 2) ...
nanmax (old kernel): GPU: 1545.598 us +/-19.431 (min: 1447.776 / max: 1568.640) us
nanmax (CUB device): (CUB device-wide reduction not available)
nanmax (CUB blocks): GPU: 1299.933 us +/-45.486 (min: 1221.440 / max: 1371.712) us
testing axis = (0, 1, 2) ...
nanmax (old kernel): GPU:62374.201 us +/- 8.502 (min:62355.873 / max:62394.623) us
nanmax (CUB device): (CUB device-wide reduction not available)
nanmax (CUB blocks): GPU:73761.331 us +/- 5.107 (min:73749.283 / max:73777.054) us
----------------------------------------------------------------------
testing axis = (2,) ...
nanargmin (old kernel): GPU: 7521.018 us +/- 8.955 (min: 7497.728 / max: 7550.592) us
nanargmin (CUB device): (CUB device-wide reduction not available)
nanargmin (CUB blocks): GPU: 1973.085 us +/-12.519 (min: 1956.928 / max: 2008.640) us
testing axis = (1, 2) ...
nanargmin (old kernel): GPU: 1849.730 us +/- 3.769 (min: 1842.944 / max: 1871.520) us
nanargmin (CUB device): (CUB device-wide reduction not available)
nanargmin (CUB blocks): GPU: 3707.225 us +/-17.858 (min: 3687.136 / max: 3776.480) us
testing axis = (0, 1, 2) ...
nanargmin (old kernel): GPU:81803.034 us +/-14.609 (min:81770.882 / max:81840.797) us
nanargmin (CUB device): (CUB device-wide reduction not available)
nanargmin (CUB blocks): GPU:190295.268 us +/-18.013 (min:190263.840 / max:190389.053) us
----------------------------------------------------------------------
testing axis = (2,) ...
nanargmax (old kernel): GPU: 7520.069 us +/- 8.667 (min: 7498.752 / max: 7543.712) us
nanargmax (CUB device): (CUB device-wide reduction not available)
nanargmax (CUB blocks): GPU: 1971.927 us +/-11.871 (min: 1956.960 / max: 2012.192) us
testing axis = (1, 2) ...
nanargmax (old kernel): GPU: 1856.170 us +/- 3.025 (min: 1851.296 / max: 1871.520) us
nanargmax (CUB device): (CUB device-wide reduction not available)
nanargmax (CUB blocks): GPU: 3695.171 us +/-12.590 (min: 3679.200 / max: 3732.896) us
testing axis = (0, 1, 2) ...
nanargmax (old kernel): GPU:81804.032 us +/-15.481 (min:81766.975 / max:81843.262) us
nanargmax (CUB device): (CUB device-wide reduction not available)
nanargmax (CUB blocks): GPU:190926.042 us +/-12.581 (min:190897.018 / max:190953.796) us
----------------------------------------------------------------------
testing axis = (2,) ...
mean (old kernel): GPU: 4919.463 us +/- 2.347 (min: 4917.248 / max: 4932.576) us
mean (CUB device): (CUB device-wide reduction not available)
mean (CUB blocks): GPU: 984.885 us +/-15.563 (min: 957.888 / max: 1039.520) us
testing axis = (1, 2) ...
mean (old kernel): GPU: 1250.211 us +/-61.111 (min: 1193.696 / max: 1351.392) us
mean (CUB device): (CUB device-wide reduction not available)
mean (CUB blocks): GPU: 1069.789 us +/-17.287 (min: 1035.040 / max: 1138.208) us
testing axis = (0, 1, 2) ...
mean (old kernel): GPU:60708.971 us +/-10.122 (min:60692.577 / max:60769.440) us
mean (CUB device): (CUB device-wide reduction not available)
mean (CUB blocks): GPU:46983.943 us +/-10.468 (min:46957.314 / max:47013.695) us
----------------------------------------------------------------------
testing axis = (2,) ...
nanmean (old kernel): GPU: 5708.619 us +/- 2.875 (min: 5705.440 / max: 5726.112) us
nanmean (CUB device): (CUB device-wide reduction not available)
nanmean (CUB blocks): GPU: 992.895 us +/-14.937 (min: 959.744 / max: 1035.840) us
testing axis = (1, 2) ...
nanmean (old kernel): GPU: 1553.189 us +/- 4.186 (min: 1545.536 / max: 1566.496) us
nanmean (CUB device): (CUB device-wide reduction not available)
nanmean (CUB blocks): GPU: 1158.552 us +/-15.002 (min: 1134.752 / max: 1205.888) us
testing axis = (0, 1, 2) ...
nanmean (old kernel): GPU:66023.918 us +/-35.811 (min:65941.635 / max:66112.961) us
nanmean (CUB device): (CUB device-wide reduction not available)
nanmean (CUB blocks): GPU:49554.622 us +/- 7.062 (min:49534.206 / max:49575.775) us
----------------------------------------------------------------------
testing axis = (2,) ...
var (old kernel): GPU: 6640.631 us +/- 7.249 (min: 6623.936 / max: 6664.608) us
var (CUB device): (CUB device-wide reduction not available)
var (CUB blocks): GPU: 2680.171 us +/- 4.932 (min: 2669.120 / max: 2692.480) us
testing axis = (1, 2) ...
var (old kernel): GPU:15180.429 us +/-141.299 (min:14959.104 / max:15639.168) us
var (CUB device): (CUB device-wide reduction not available)
var (CUB blocks): GPU:14967.292 us +/-144.692 (min:14789.376 / max:15738.304) us
testing axis = (0, 1, 2) ...
var (old kernel): GPU:1871272.574 us +/-304.004 (min:1870493.774 / max:1871963.867) us
var (CUB device): (CUB device-wide reduction not available)
var (CUB blocks): GPU:114163.430 us +/-17.509 (min:114111.069 / max:114227.264) us
----------------------------------------------------------------------
testing axis = (2,) ...
nanvar (old kernel): GPU:11511.579 us +/- 2.125 (min:11508.064 / max:11520.352) us
nanvar (CUB device): (CUB device-wide reduction not available)
nanvar (CUB blocks): GPU: 4742.906 us +/- 2.971 (min: 4738.720 / max: 4759.840) us
testing axis = (1, 2) ...
nanvar (old kernel): GPU:27261.787 us +/-14.532 (min:27224.735 / max:27303.360) us
nanvar (CUB device): (CUB device-wide reduction not available)
nanvar (CUB blocks): GPU:26852.691 us +/-16.575 (min:26819.040 / max:26892.769) us
testing axis = (0, 1, 2) ...
nanvar (old kernel): GPU:4169472.632 us +/-66.999 (min:4169318.848 / max:4169709.961) us
nanvar (CUB device): (CUB device-wide reduction not available)
nanvar (CUB blocks): GPU:278711.689 us +/-40.308 (min:278618.988 / max:278808.868) us
----------------------------------------------------------------------
testing axis = (2,) ...
nansum (old kernel): GPU: 4344.424 us +/- 2.046 (min: 4342.048 / max: 4357.792) us
nansum (CUB device): (CUB device-wide reduction not available)
nansum (CUB blocks): GPU: 953.723 us +/- 1.024 (min: 952.000 / max: 958.880) us
testing axis = (1, 2) ...
nansum (old kernel): GPU: 1246.913 us +/- 2.604 (min: 1242.432 / max: 1259.712) us
nansum (CUB device): (CUB device-wide reduction not available)
nansum (CUB blocks): GPU: 1053.856 us +/- 3.706 (min: 1049.184 / max: 1069.664) us
testing axis = (0, 1, 2) ...
nansum (old kernel): GPU:63447.723 us +/- 6.865 (min:63433.632 / max:63474.049) us
nansum (CUB device): (CUB device-wide reduction not available)
nansum (CUB blocks): GPU:47387.075 us +/- 2.659 (min:47384.960 / max:47401.279) us
----------------------------------------------------------------------
testing axis = (2,) ...
nanprod (old kernel): GPU: 4345.197 us +/- 2.536 (min: 4342.880 / max: 4358.656) us
nanprod (CUB device): (CUB device-wide reduction not available)
nanprod (CUB blocks): GPU: 993.034 us +/-17.009 (min: 961.696 / max: 1055.584) us
testing axis = (1, 2) ...
nanprod (old kernel): GPU: 1252.908 us +/- 5.268 (min: 1245.056 / max: 1266.624) us
nanprod (CUB device): (CUB device-wide reduction not available)
nanprod (CUB blocks): GPU: 1088.906 us +/-14.730 (min: 1069.088 / max: 1148.384) us
testing axis = (0, 1, 2) ...
nanprod (old kernel): GPU:63449.758 us +/- 6.428 (min:63435.360 / max:63466.496) us
nanprod (CUB device): (CUB device-wide reduction not available)
nanprod (CUB blocks): GPU:47388.084 us +/- 8.357 (min:47383.457 / max:47465.439) us
----------------------------------------------------------------------
testing axis = (2,) ...
all (old kernel): GPU: 4558.527 us +/- 8.207 (min: 4552.864 / max: 4594.304) us
all (CUB device): (CUB device-wide reduction not available)
all (CUB blocks): GPU: 987.198 us +/-16.190 (min: 965.152 / max: 1054.752) us
testing axis = (1, 2) ...
all (old kernel): GPU: 1218.613 us +/-14.359 (min: 1210.784 / max: 1357.472) us
all (CUB device): (CUB device-wide reduction not available)
all (CUB blocks): GPU: 1054.616 us +/-15.980 (min: 1030.752 / max: 1114.624) us
testing axis = (0, 1, 2) ...
all (old kernel): GPU:63045.030 us +/- 3.270 (min:63037.983 / max:63056.030) us
all (CUB device): (CUB device-wide reduction not available)
all (CUB blocks): GPU:44248.783 us +/- 8.405 (min:44229.759 / max:44272.224) us
----------------------------------------------------------------------
testing axis = (2,) ...
any (old kernel): GPU: 4551.490 us +/- 2.726 (min: 4548.736 / max: 4565.888) us
any (CUB device): (CUB device-wide reduction not available)
any (CUB blocks): GPU: 980.970 us +/-14.647 (min: 962.368 / max: 1031.104) us
testing axis = (1, 2) ...
any (old kernel): GPU: 1227.028 us +/- 6.295 (min: 1216.128 / max: 1253.536) us
any (CUB device): (CUB device-wide reduction not available)
any (CUB blocks): GPU: 1064.959 us +/-13.211 (min: 1047.040 / max: 1118.752) us
testing axis = (0, 1, 2) ...
any (old kernel): GPU:63815.470 us +/-11.096 (min:63790.302 / max:63843.521) us
any (CUB device): (CUB device-wide reduction not available)
any (CUB blocks): GPU:48141.098 us +/- 6.768 (min:48132.000 / max:48183.426) us
----------------------------------------------------------------------
testing axis = (2,) ...
count_nonzero (old kernel): GPU: 4333.822 us +/- 7.074 (min: 4328.640 / max: 4365.376) us
count_nonzero (CUB device): (CUB device-wide reduction not available)
count_nonzero (CUB blocks): GPU: 977.654 us +/-11.242 (min: 962.656 / max: 1034.304) us
testing axis = (1, 2) ...
count_nonzero (old kernel): GPU: 1238.971 us +/- 4.179 (min: 1232.992 / max: 1253.280) us
count_nonzero (CUB device): (CUB device-wide reduction not available)
count_nonzero (CUB blocks): GPU: 1060.503 us +/-13.000 (min: 1043.104 / max: 1112.512) us
testing axis = (0, 1, 2) ...
count_nonzero (old kernel): GPU:64118.497 us +/-20.636 (min:64062.462 / max:64168.701) us
count_nonzero (CUB device): (CUB device-wide reduction not available)
count_nonzero (CUB blocks): GPU:47377.478 us +/- 3.837 (min:47369.022 / max:47395.615) us
----------------------------------------------------------------------
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.