Comments (4)
@al42and
Thank you very much for the feedback. This is something we are aware of as a potentially beneficial feature for users who are interested in improved performance at the cost of some convenience. Your post very clearly lays out this demand.
My expectation is that if oneDPL were to support something like this, it would be in the context of kernel template APIs. It matches with the mindset behind kernel templates, which is to give the user more control in the pursuit of better performance and at the cost of some generality. We are considering this feature as well as others to prioritize performance within that effort.
from onedpl.
Hi @danhoeflinger,
My expectation is that if oneDPL were to support something like this, it would be in the context of kernel template APIs.
If you have a clear idea of how the API will look like, could you elaborate further, please? I can't understand how a runtime pointer can be passed this way. Or are you suggesting a flag "persist the working buffer past the launch and reuse the old one if it exists"?
We are considering this feature as well as others to prioritize performance within that effort.
To be clear: this is not a priority for us (GROMACS). So far, we are operating on an array of a fixed size of 8k elements, so it's a single kernel launch without any working buffers. But that size was set arbitrary, so this problem can become pressing eventually if we go past 16k, and I decide to raise the issue proactively.
from onedpl.
@al42and
I don't have specific information at this time about what the API would look like, but this issue of temporary memory allocation reuse something we are considering. The "kernel template" APIs are not fixed to the C++ standard libraries parallel algorithms specification and we envision API adjustments to support functionality like this. There are multiple possible approaches to support this feature, one of these options is the addition of an extra API to query temporary space required and extra runtime parameter(s) to accept externally allocated memory.
Separate from a potential oneDPL feature, there are ways to mitigate the performance penalty from these repeated temporary allocations currently available in the oneAPI DPC++ compiler. I suggest taking a look at the SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR
environment variable:
https://intel.github.io/llvm-docs/EnvironmentVariables.html#debugging-variables-for-level-zero-plugin
This environment variable allows the configuring of memory pool sizes used by the level zero USM allocator. For repeated oneDPL calls of the same size, this can help reduce the performance impact of this temporary allocation by reusing allocations from a memory pool. If you need help selecting values for this environment variable we can work with you on your specific use case, but some experimentation may be necessary.
When we have more information, we will update here. Thanks again for the feedback.
from onedpl.
I suggest taking a look at the SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR environment variable:
intel.github.io/llvm-docs/EnvironmentVariables.html#debugging-variables-for-level-zero-plugin
That will not work for other backends, unfortunately; and even for L0 is not a user-friendly solution.
But thanks for suggesting it as a workaround, it could definitely be helpful during the development.
from onedpl.
Related Issues (20)
- More async support (`unique_copy`, `upper_bound`, `adjacent_difference`)
- Linking to oneDPL adds -fsycl compile flag to all files in the target
- Implementation of the `lower_bound_impl` and other function near that.
- Required to call `__parallel_for` through __internal::__except_handler HOT 1
- What's mean `host policy` ?
- std::move call under __except_handler
- `__is_const_callable_object_v` does not work for objects with multiple overloads
- [oneDPL][hetero] call wait for two events
- Investigate a 32-bit addressing mode in SYCL kernels HOT 1
- Better to rewrite as `= default;`
- Improvements / fixes to device copyable specializations HOT 1
- [Post-C++ committee, Tokyo] Investigate the applicability of LWG3918 issue
- [Post-C++ committee, Tokyo] Investigate the applicability of P2248 proposal
- [Post-C++ committee, Tokyo] Investigate the applicability of P1068 proposal
- Investigate usage of _ReverseCounter in __pstl_left_bound HOT 1
- Scan kernel template support for initial value
- Optimize atomic operations in scan kernel template HOT 1
- Improve guidance for choosing kernel template parameters of scan based on memory requirements
- Additional test scenario required here
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onedpl.