Giter Site home page Giter Site logo

Comments (5)

tugrul512bit avatar tugrul512bit commented on May 24, 2024 1

It works as same sets of waves of pipelines for each device. It also limits load balancing granularity when enabled. I can't log in to windows now, its bugged. But you can clone project and add a if else sentence near the pipelining part to check if it was an integrated gpu or cpu but this is a bit complex because of my writing style (sorry for this, I was learning C# doing this :) )

Another easy way is to decrease number of waves of pipelining (2 instead of 4 for example) and clone the discrete device object twice. This would allocate more threads on discrete gpu.

Maybe even without pipelining but just cloning discrete gpu 3-4 times, not cloning CPU, in the explicit device selection part. Having 3-4 threads per gpu should be enough. But not sure which workloads get better at this.


(assuming your project needs just an embarrasingly parallel action)
If you want GPUs independently issue commands as soon as possible, device pool and task pool

https://github.com/tugrul512bit/Cekirdekler/wiki/Device-Pool-and-Task-Pool

is nearly as easy as using pipelineEnabled=true. You will need to add "to be pipelined" devices multiple times into it. 2 for a discrete, 1 for and inegrated, 1 for CPU, for example. Some expensive cards can even overlap 3-4 queues so you can even have 3-4 clones of a discrete GPU to hide pcie latencies.

You can even use constraints in task pool using proper masks like

TASK_MESSAGE_BROADCAST

TASK_MESSAGE_GLOBAL_SYNCHRONIZATION_FIRST

and similar.

DEVICE_COMPUTE_AT_WILL

will make any device compute on any task in the pool as soon as possible. Somewhat depends on CPU performance too if there are too many devices. This thing's throughput should be very close to that of a pipeline. Maybe even more! Because there is no synchronization between devices in this. You just need to divide work into tasks such as

compute 0 to 1023
compute 1024 to 2047
compute 2048 to ....

maybe within a preparation loop.

Example of cloning a device:

    // gets first gpu and makes it 2 virtual gpus as devices3 variable.
     var devices = platforms.platformsIntel().gpus().devicesWithMostComputeUnits()[0];
     var devices2 = platforms.platformsIntel().gpus().devicesWithMostComputeUnits()[0];
     var devices3= devices+devices2; // any number of devices can be added like this, not just one

Another advantage of device-task pool is there is performance aware computing. If one device gets stuck for a while, others can quickly finish rest of job. Probably performs better than pipelining but with a bit more complex codes ofcourse. You may need to balance number of tasks, probably with your own dynamic codes, to have both enough work sharing and less latency.

from cekirdekler.

tugrul512bit avatar tugrul512bit commented on May 24, 2024 1

Edit: This is obsolete. Just continue on next message.

For queue-backed features such as device - task pool, parallel-for performance of C# is important so using a newer .Net version could help it get faster on host side.

from cekirdekler.

tugrul512bit avatar tugrul512bit commented on May 24, 2024 1

I just noticed I've already upgraded from parallel.for to dedicated threads on device pool.

So you get boost only on

  • device to device pipeline
  • simple compute

its been long time I looked at codes, really sorry.

You should test with simplest case first. Multiple discrete gpu clones + 1 cpu + 1 igpu in same device object without pipelining enabled so its like pipelining for the cloned thing.

Also it can take nearly 50 steps to converge into a fair load balancing for some workloads and systems.

Iterative load balancer solver is meant to be more performant than precise :)

from cekirdekler.

tugrul512bit avatar tugrul512bit commented on May 24, 2024 1

Also if your CPU has not enough cores, you can decrease number of cores to use in "compute" so that 1-2 cores can still be free to feed GPUs.

To select 5 cores for example:

Hardware.ClPlatforms platforms = Hardware.ClPlatforms.all();
platforms.platformsAmd().cpus(true,streamOnOff,5);

it will try to select 5 if there are any. If not, will pick N-1 by default (or I remember like this).

I developed this with an fx8150 + hd7870 and a N3060+HD440 (or some other low end Intel igpu) and it did help keeping GPUs more in-flight.

When all CPU cores are busy computing a kernel, it really limits capabilities of any feature. Especially when RAM bandwidth is low.

from cekirdekler.

tugrul512bit avatar tugrul512bit commented on May 24, 2024 1

Even also,

platforms.devicesWithDedicatedMemory()

can be useful to know what has a dedicated memory or not when cloning them. Those having a dedicated memory most probably also have 1 or 2 DMA engines to copy things to/from RAM. Didn't test for all dGPUs and special APUs(with DRAM inside CPU) but expect to do this.

Get dedicated ones, clone once or twice, add onto all devices so that dedicated memory devices will have two or three copies while others only 1.

from cekirdekler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.