Comments (5)
It works as same sets of waves of pipelines for each device. It also limits load balancing granularity when enabled. I can't log in to windows now, its bugged. But you can clone project and add a if else sentence near the pipelining part to check if it was an integrated gpu or cpu but this is a bit complex because of my writing style (sorry for this, I was learning C# doing this :) )
Another easy way is to decrease number of waves of pipelining (2 instead of 4 for example) and clone the discrete device object twice. This would allocate more threads on discrete gpu.
Maybe even without pipelining but just cloning discrete gpu 3-4 times, not cloning CPU, in the explicit device selection part. Having 3-4 threads per gpu should be enough. But not sure which workloads get better at this.
(assuming your project needs just an embarrasingly parallel action)
If you want GPUs independently issue commands as soon as possible, device pool and task pool
https://github.com/tugrul512bit/Cekirdekler/wiki/Device-Pool-and-Task-Pool
is nearly as easy as using pipelineEnabled=true. You will need to add "to be pipelined" devices multiple times into it. 2 for a discrete, 1 for and inegrated, 1 for CPU, for example. Some expensive cards can even overlap 3-4 queues so you can even have 3-4 clones of a discrete GPU to hide pcie latencies.
You can even use constraints in task pool using proper masks like
TASK_MESSAGE_BROADCAST
TASK_MESSAGE_GLOBAL_SYNCHRONIZATION_FIRST
and similar.
DEVICE_COMPUTE_AT_WILL
will make any device compute on any task in the pool as soon as possible. Somewhat depends on CPU performance too if there are too many devices. This thing's throughput should be very close to that of a pipeline. Maybe even more! Because there is no synchronization between devices in this. You just need to divide work into tasks such as
compute 0 to 1023
compute 1024 to 2047
compute 2048 to ....
maybe within a preparation loop.
Example of cloning a device:
// gets first gpu and makes it 2 virtual gpus as devices3 variable.
var devices = platforms.platformsIntel().gpus().devicesWithMostComputeUnits()[0];
var devices2 = platforms.platformsIntel().gpus().devicesWithMostComputeUnits()[0];
var devices3= devices+devices2; // any number of devices can be added like this, not just one
Another advantage of device-task pool is there is performance aware computing. If one device gets stuck for a while, others can quickly finish rest of job. Probably performs better than pipelining but with a bit more complex codes ofcourse. You may need to balance number of tasks, probably with your own dynamic codes, to have both enough work sharing and less latency.
from cekirdekler.
Edit: This is obsolete. Just continue on next message.
For queue-backed features such as device - task pool, parallel-for performance of C# is important so using a newer .Net version could help it get faster on host side.
from cekirdekler.
I just noticed I've already upgraded from parallel.for to dedicated threads on device pool.
So you get boost only on
- device to device pipeline
- simple compute
its been long time I looked at codes, really sorry.
You should test with simplest case first. Multiple discrete gpu clones + 1 cpu + 1 igpu in same device object without pipelining enabled so its like pipelining for the cloned thing.
Also it can take nearly 50 steps to converge into a fair load balancing for some workloads and systems.
Iterative load balancer solver is meant to be more performant than precise :)
from cekirdekler.
Also if your CPU has not enough cores, you can decrease number of cores to use in "compute" so that 1-2 cores can still be free to feed GPUs.
To select 5 cores for example:
Hardware.ClPlatforms platforms = Hardware.ClPlatforms.all();
platforms.platformsAmd().cpus(true,streamOnOff,5);
it will try to select 5 if there are any. If not, will pick N-1 by default (or I remember like this).
I developed this with an fx8150 + hd7870 and a N3060+HD440 (or some other low end Intel igpu) and it did help keeping GPUs more in-flight.
When all CPU cores are busy computing a kernel, it really limits capabilities of any feature. Especially when RAM bandwidth is low.
from cekirdekler.
Even also,
platforms.devicesWithDedicatedMemory()
can be useful to know what has a dedicated memory or not when cloning them. Those having a dedicated memory most probably also have 1 or 2 DMA engines to copy things to/from RAM. Didn't test for all dGPUs and special APUs(with DRAM inside CPU) but expect to do this.
Get dedicated ones, clone once or twice, add onto all devices so that dedicated memory devices will have two or three copies while others only 1.
from cekirdekler.
Related Issues (20)
- nonPartialWrite capability for buffers HOT 3
- Enqueue mode with single gpu (and for device to device pipeline) ---- lower latency per command HOT 3
- Read-only and write-only flags for ClArray HOT 2
- ClArray.name to bind an array to a kernel parameter with exact spelling HOT 1
- ClArray.async to make an array copy operation done on another commandQueue(concurrently) HOT 1
- clNumberCruncher.enqueueModeAsyncEnable to enqueue different kernels and arrays concurrently
- single device pipeline: overlapping regions percentage in total latency
- single device pipeline: kernel repeat option
- add "batch mode compute"(pool of devices for pool of kernels) with multiple devices where each compute() is computed by 1 device only, with greedy scheduling
- array.nextParam(array2).task() ---> creates ClTask to compute later in pool, with all the fields set at that time but with the latest array data
- add multiple opencl-kernel instances for different compute-id values, for tiled computing, in task pool, with device pool
- add callback option to ClTask
- add duplicated compute option to device pool / task pool / task for initializing same buffer on all devices
- add task types to control pool behavior (sync, broadcast task, shutdown devices)
- 1D NBODY scores HOT 9
- Is there an example of generating a Unity Texture? HOT 4
- Any of the opencl 2 version does not work HOT 38
- How to share Big Array, like a lookup table among various kernel calls HOT 6
- Mandelbrot benchmark's or other test's source
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cekirdekler.