Hi everyone, we are currently investigating the use of AdaptiveCPP for our use cas

Hi! A couple of pointers: The SYCL specifica

Performing multiple reductions about adaptivecpp HOT 2 OPEN

eigenraum commented on June 12, 2024

Performing multiple reductions

from adaptivecpp.

Comments (2)

illuhad commented on June 12, 2024

Hi!

A couple of pointers:

The SYCL specification does not support non-scalar reductions. E.g. You cannot reduce a 3D array along one axis to a 2D array. This is independent of span<> support. For such a numpy-like operation, you would need some higher-level library (I don't know whether this exists), as this is beyond the scope of the language, which, after all, focuses more on providing lower-level primitives to aid in the construction of such high-level libraries.
What you can do is have multiple reduction objects in one kernel. E.g. q.parallel_for(range, sycl::reduction(/* specifiy 1st reduction*/), sycl::reduction(/*specifiy 2nd reduction*/), [...specify more reductions if desired...], kernel);. This is an appropriate solution if you know the number of reductions at compile time.
That the reduction result is at the first element in the buffer independently of the offset might be a bug; however this API is non-standard anyway as the SYCL 2020 final specification has changed the API to work in terms of buffer directly, not accessor, and we have not yet updated the API. Of course, when working with buffer you won't be able to specify this anyway.
To solve this, I recommend using USM pointers instead of the buffer-accessor API, as this will give you much more control over where reduction results end up (you can just pass in an arbitrary pointer with arbitrary offset). Also, the USM memory model is generally more efficient compared to buffer-accessor as overheads are lower.
In general, be aware that the sycl::reduction support built into the language is intended as a building block, not necessarily a performance-portable algorithm solution. For example, sycl::reduction does not handle how many data elements from the input array should be reduced by a single work item (i.e. how many calls to combine() you have per work item). This can however be an extremely important tuning parameter, depending on the hardware!
Also, be aware that our sycl::reduction support is incomplete. As I have mentioned, it does not yet implement the SYCL 2020 final API, and it is also not supported on all compilation flows.
Our more high-level algorithms hipsycl::algorithms::transform_reduce() and its cousin from our parallel STL offloading support std::transform_reduce(std::execution::par_unseq) are universally supported across all compilation flows, and already handle the most important tuning parameters for you, and thus generally perform better - at least out of the box. However, the transform_reduce() API only supports a single reduction.

from adaptivecpp.

eigenraum commented on June 12, 2024

Hi illuhad,

thanks for your reply, that is very helpful to me! I will investigate the use of USM instead of buffers / accessors and also have a look at the transform_reduce high-level algorithm that you mentioned.

Concerning the algorithmic problem, the issue is that I do not know the number of reductions at compile time, but that these depend on intermediate results. Probably that means that I am best off by enqueueing each reduction separately, right?

Best regards!

from adaptivecpp.

Performing multiple reductions about adaptivecpp HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent