Giter Site home page Giter Site logo

Extremely slow reduction about oneapi.jl HOT 7 OPEN

juliagpu avatar juliagpu commented on June 1, 2024
Extremely slow reduction

from oneapi.jl.

Comments (7)

maleadt avatar maleadt commented on June 1, 2024

Definitely the kernel:

image

from oneapi.jl.

maleadt avatar maleadt commented on June 1, 2024

Occupancy of the main kernel is good at 95%, EUs are 93% active, so I'm wondering if I'm doing something fundamentally wrong here.

image

from oneapi.jl.

freemin7 avatar freemin7 commented on June 1, 2024

So something improved since then:

julia> @benchmark sum(da)
BenchmarkTools.Trial: 106 samples with 1 evaluation.
 Range (min … max):  47.019 ms …  47.963 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.258 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.297 ms ± 166.846 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁ ▁  █▄▁▃ ▃▁▁ ▃  ▃▄       ▃                             
  ▆▁▁▄▄▁█▄█▆▆████▇███▁█▇▇██▄▆▇▄▁▁▆█▄▄▄▁▆▁▄▄▁▄▁▁▁▁▁▁▄▁▁▆▄▁▁▁▁▁▄ ▄
  47 ms           Histogram: frequency by time         47.8 ms <

 Memory estimate: 29.81 KiB, allocs estimate: 515.

On the same hardware: ZeDevice(GPU, vendor 0x8086, device 0x3e96): Intel(R) UHD Graphics P630 [0x3e96]

Some uneducated guesses what might be going wrong:

  • # load the neutral value
    Iout = CartesianIndex(Tuple(Iother)..., groupIdx_reduce)
    neutral = if neutral === nothing
    R[Iout]
    else
    neutral
    end
    val = op(neutral, neutral)

    I interpret that as we compute the neutral element in every unit of compute. There is indeed a fadd %x %x in the kernel related to branch.

  • Every length(Rother) seems to involve two memory accesses and a multiplication. Julia doesn't seem to exploit the fact that length(Rother) is constant in map reduces. Calculation length(Rother) et al. on the CPU and passing it in might help.

  • I am probably wrong but i feel like there is a barrier missing here

from oneapi.jl.

maleadt avatar maleadt commented on June 1, 2024

Those are valid concerns, but I doubt that they are responsible for the huge slowdown. The reduce implementation is taken from CUDA.jl, where it performs well.

from oneapi.jl.

freemin7 avatar freemin7 commented on June 1, 2024

I will poke at it a bit.
There are differences between CUDA and SYCL though https://sycl.tech/assets/files/Michel_Migdal_Codeplay_Porting_Tips_CDUA_To_SYCL.pdf some off them sound relevant.

from oneapi.jl.

maleadt avatar maleadt commented on June 1, 2024

That looks like an interesting document. I haven't had the time yet to optimize oneAPI.jl, only focusing on features right now, so it would be great if you would have the time to take a look :-) Let me know if there's anything I can help with.

from oneapi.jl.

maleadt avatar maleadt commented on June 1, 2024

FWIW, on an A770 (vs a 5950X):

julia> @benchmark sum($a)
BenchmarkTools.Trial: 2677 samples with 1 evaluation.
 Range (min … max):  1.794 ms …  2.207 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.878 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.862 ms ± 37.677 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▃▂█▇▄▃
  ▃▄▅▆▆▇▇▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▃▇███████▆▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  1.79 ms        Histogram: frequency by time        1.95 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum($d_a)
BenchmarkTools.Trial: 1105 samples with 1 evaluation.
 Range (min … max):  1.718 ms … 11.036 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.735 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.527 ms ±  2.293 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▂               ▁ ▁
  ███▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁█▆█▇█▇▃▅▄▃▃▃▂▂▂▂▄▄▃▄▄▄▃▂▂▁▂▁▁▁▂▂▁▁▃▃▃▄▃▂ ▃
  1.72 ms        Histogram: frequency by time        10.2 ms <

 Memory estimate: 31.45 KiB, allocs estimate: 588.

So still way too slow, but at least not outperformed by the CPU...

from oneapi.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.