Comments (2)
Hi!
A couple of pointers:
- The SYCL specification does not support non-scalar reductions. E.g. You cannot reduce a 3D array along one axis to a 2D array. This is independent of
span<>
support. For such a numpy-like operation, you would need some higher-level library (I don't know whether this exists), as this is beyond the scope of the language, which, after all, focuses more on providing lower-level primitives to aid in the construction of such high-level libraries. - What you can do is have multiple reduction objects in one kernel. E.g.
q.parallel_for(range, sycl::reduction(/* specifiy 1st reduction*/), sycl::reduction(/*specifiy 2nd reduction*/), [...specify more reductions if desired...], kernel);
. This is an appropriate solution if you know the number of reductions at compile time. - That the reduction result is at the first element in the buffer independently of the offset might be a bug; however this API is non-standard anyway as the SYCL 2020 final specification has changed the API to work in terms of
buffer
directly, notaccessor
, and we have not yet updated the API. Of course, when working withbuffer
you won't be able to specify this anyway. - To solve this, I recommend using USM pointers instead of the buffer-accessor API, as this will give you much more control over where reduction results end up (you can just pass in an arbitrary pointer with arbitrary offset). Also, the USM memory model is generally more efficient compared to buffer-accessor as overheads are lower.
- In general, be aware that the
sycl::reduction
support built into the language is intended as a building block, not necessarily a performance-portable algorithm solution. For example,sycl::reduction
does not handle how many data elements from the input array should be reduced by a single work item (i.e. how many calls tocombine()
you have per work item). This can however be an extremely important tuning parameter, depending on the hardware! - Also, be aware that our
sycl::reduction
support is incomplete. As I have mentioned, it does not yet implement the SYCL 2020 final API, and it is also not supported on all compilation flows. - Our more high-level algorithms
hipsycl::algorithms::transform_reduce()
and its cousin from our parallel STL offloading supportstd::transform_reduce(std::execution::par_unseq)
are universally supported across all compilation flows, and already handle the most important tuning parameters for you, and thus generally perform better - at least out of the box. However, thetransform_reduce()
API only supports a single reduction.
from adaptivecpp.
Hi illuhad,
thanks for your reply, that is very helpful to me! I will investigate the use of USM instead of buffers / accessors and also have a look at the transform_reduce high-level algorithm that you mentioned.
Concerning the algorithmic problem, the issue is that I do not know the number of reductions at compile time, but that these depend on intermediate results. Probably that means that I am best off by enqueueing each reduction separately, right?
Best regards!
from adaptivecpp.
Related Issues (20)
- joint_exclusive_scan does not work in-place HOT 1
- [stdpar] amdgcn-link command failed with exit code 1 HOT 3
- cuda error HOT 1
- SMCP for OpenCL? HOT 3
- cannot find memory and stdio HOT 2
- sycl::queue::memcpy doesn't work from device to host memory HOT 1
- Linking Error with HIP SDK 5.7 (Windows) HOT 2
- Problem when using HIP backend with `hip.explicit-multipass` (Windows)
- cmake error HOT 14
- test code fails on first run HOT 1
- cmake error: Could not find a package configuration file HOT 2
- GoogleTest DeathTests not working correctly with --acpp-stdpar
- undefined symbol in libaccp-clang.so HOT 8
- Typo in error report HOT 1
- Using `sycl::atomic_ref<...>{}.store()` defaults to `__ATOMIC_SEQ_CST` leading to a compiler error. HOT 3
- Problem with CUDA backend HOT 2
- Compatibility issue with latest Microsoft C++ Library on Windows HOT 2
- [stdpar] Hang inside std::transform_reduce in TeaLeaf on Intel(R) Data Center GPU Max 1100 / OpenCL HOT 6
- Host-to-host memcpy broken with with instant submission on CUDA backend HOT 1
- Simple reduction example fails with an assert when AdaptiveCpp is compiled in debug mode.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adaptivecpp.