At Intel(R) HD graphics 4400, the function "void compute" is set at pipelineEnabled

Can you set pipeline mode for each device separately? about cekirdekler HOT 5 CLOSED

jinxiu0406 commented on May 24, 2024

Can you set pipeline mode for each device separately?

from cekirdekler.

Comments (5)

tugrul512bit commented on May 24, 2024 1

It works as same sets of waves of pipelines for each device. It also limits load balancing granularity when enabled. I can't log in to windows now, its bugged. But you can clone project and add a if else sentence near the pipelining part to check if it was an integrated gpu or cpu but this is a bit complex because of my writing style (sorry for this, I was learning C# doing this :) )

Another easy way is to decrease number of waves of pipelining (2 instead of 4 for example) and clone the discrete device object twice. This would allocate more threads on discrete gpu.

Maybe even without pipelining but just cloning discrete gpu 3-4 times, not cloning CPU, in the explicit device selection part. Having 3-4 threads per gpu should be enough. But not sure which workloads get better at this.

(assuming your project needs just an embarrasingly parallel action)
If you want GPUs independently issue commands as soon as possible, device pool and task pool

https://github.com/tugrul512bit/Cekirdekler/wiki/Device-Pool-and-Task-Pool

is nearly as easy as using pipelineEnabled=true. You will need to add "to be pipelined" devices multiple times into it. 2 for a discrete, 1 for and inegrated, 1 for CPU, for example. Some expensive cards can even overlap 3-4 queues so you can even have 3-4 clones of a discrete GPU to hide pcie latencies.

You can even use constraints in task pool using proper masks like

TASK_MESSAGE_BROADCAST

TASK_MESSAGE_GLOBAL_SYNCHRONIZATION_FIRST

and similar.

DEVICE_COMPUTE_AT_WILL

will make any device compute on any task in the pool as soon as possible. Somewhat depends on CPU performance too if there are too many devices. This thing's throughput should be very close to that of a pipeline. Maybe even more! Because there is no synchronization between devices in this. You just need to divide work into tasks such as

compute 0 to 1023
compute 1024 to 2047
compute 2048 to ....

maybe within a preparation loop.

Example of cloning a device:

    // gets first gpu and makes it 2 virtual gpus as devices3 variable.
     var devices = platforms.platformsIntel().gpus().devicesWithMostComputeUnits()[0];
     var devices2 = platforms.platformsIntel().gpus().devicesWithMostComputeUnits()[0];
     var devices3= devices+devices2; // any number of devices can be added like this, not just one

Another advantage of device-task pool is there is performance aware computing. If one device gets stuck for a while, others can quickly finish rest of job. Probably performs better than pipelining but with a bit more complex codes ofcourse. You may need to balance number of tasks, probably with your own dynamic codes, to have both enough work sharing and less latency.

from cekirdekler.

tugrul512bit commented on May 24, 2024 1

Edit: This is obsolete. Just continue on next message.

For queue-backed features such as device - task pool, parallel-for performance of C# is important so using a newer .Net version could help it get faster on host side.

from cekirdekler.

tugrul512bit commented on May 24, 2024 1

I just noticed I've already upgraded from parallel.for to dedicated threads on device pool.

So you get boost only on

device to device pipeline
simple compute

its been long time I looked at codes, really sorry.

You should test with simplest case first. Multiple discrete gpu clones + 1 cpu + 1 igpu in same device object without pipelining enabled so its like pipelining for the cloned thing.

Also it can take nearly 50 steps to converge into a fair load balancing for some workloads and systems.

Iterative load balancer solver is meant to be more performant than precise :)

from cekirdekler.

tugrul512bit commented on May 24, 2024 1

Also if your CPU has not enough cores, you can decrease number of cores to use in "compute" so that 1-2 cores can still be free to feed GPUs.

To select 5 cores for example:

Hardware.ClPlatforms platforms = Hardware.ClPlatforms.all();
platforms.platformsAmd().cpus(true,streamOnOff,5);

it will try to select 5 if there are any. If not, will pick N-1 by default (or I remember like this).

I developed this with an fx8150 + hd7870 and a N3060+HD440 (or some other low end Intel igpu) and it did help keeping GPUs more in-flight.

When all CPU cores are busy computing a kernel, it really limits capabilities of any feature. Especially when RAM bandwidth is low.

from cekirdekler.

tugrul512bit commented on May 24, 2024 1

Even also,

platforms.devicesWithDedicatedMemory()

can be useful to know what has a dedicated memory or not when cloning them. Those having a dedicated memory most probably also have 1 or 2 DMA engines to copy things to/from RAM. Didn't test for all dGPUs and special APUs(with DRAM inside CPU) but expect to do this.

Get dedicated ones, clone once or twice, add onto all devices so that dedicated memory devices will have two or three copies while others only 1.

from cekirdekler.

Can you set pipeline mode for each device separately? about cekirdekler HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent