omlins / parallelstencil.jl Goto Github PK

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs

License: BSD 3-Clause "New" or "Revised" License

Julia 100.00%

julia gpu stencil-codes parallel multi-gpu staggered-grids cuda multi-xpu xpu

parallelstencil.jl's Issues

Please add code for visualization

Is it possible to add some code for visualization to this example, as has been described here?

Caching the GPU kernels?

After starting Julia, the first usage / compilation of the GPU kernels takes about 30s for me. This is done everytime when I restart Julia.
Is there a possibility that this is cached?
Since Julia 1.9, it seems to me like most compilation is cached, it would be great, if this is possible for this package as well.

Efficient cache-aware and cache-oblivious implementations

There's a great example, diffusion2D_shmem_novis.jl, which shows how to leverage the shared memory of a GPU (thereby taking advantage of spatial locality common with so many stencil codes).

It would be great if we could abstract this slightly such that we could implement cache-aware algorithms on the CPU using the same kernel code as the GPU. In essence, one divides the domain into chunks (or blocks, on a GPU) and then loops over these chunks assigning workers as needed. Unfortunately, one doesn't have direct access to the L1 cache within a CPU (otherwise it wouldn't be called a cache). However it's well known that by tiling the arrays such that they fit into cache, the compiler will keep the data in cache, allowing you to leverage spatial locality (as is often done with CPU-based tiled matrix multiplication).

Currently, the CPU loops over individual indices... so even if someone manually chunked the domain, there would be quite a lot of cache contention. Instead, it would be better to loop over tiles in this case. (there's a bit of extra working ensuring that the same kernel code could be used on a CPU and a GPU because of the extra worker hierarchy within GPUs).

Even better would be an abstract interface to a cache-oblivious implementation that "just works" on CPUs and GPUs (i.e. a recursive technique could be used to automatically detect cache sizes). Cache-oblivious algorithms are great for a number of reasons besides portability. Namely, they tend to leverage all levels of the cache (L2, L3, etc) better than cache-aware algorithms.

I realize this is a somewhat grandiose vision... I'm just throwing it out there for people to toy with 🙂

CUDA 3.x GPU method overrides `CUDA.pow()` error

The scalar_porowaves2D.jl and HydroMech2D.jl miniapp codes produce a runtime error when running on the GPU using CUDA v3.x on Julia v1.6 as the @pow() macro calling CUDA.pow() in the GPU kernels is no longer supported (CUDA.jl implements now following GPU method overrides ). See here for the full supported intrinsic list.

The solution is to replace @pow(a, b) by a^b both in the host code and the device functions (kernels).

Example for init_global_grid_usage

how do you to launch an example (e.g. diffusion3D_multigpucpu_novis.jl) on multiple processes ?

register

@JuliaRegistrator register()

handling of CUDA backend unit tests if CUDA is not functional

It would be nice if the CUDA backend unit tests would be skipped if CUDA is not functional instead of producing errors.

AMDGPU v0.5.0 compat

Since AMDGPU v0.5.0 gridsize represents the number of "workgroups" (or blocks in CUDA) and no longer "workitems * workgroups" (or threads * blocks in CUDA) as HIP is used for kernel launches instead of HSA.

This means that previous AMDGPU kernel launch param gridsize need to be adapted from gridsize = nx to gridsize = cld(nx, groupsize).

Also, queues and signals are now abstracted into streams.

`@hide_communication` error when inside a module and using `CUDA.jl` backend

Currently, the @hide_communication macro does not work when using it within a module, e.g. in the context of a package that depends on ParallelStencil (as done for example in AcousticWaveCPML.jl).

Error description

The following error pops up during precompilation of the package that depends on ParallelStencil:

...
ERROR: LoadError: Evaluation into the closed module `ParallelKernel` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `ParallelKernel` with `eval` during precompilation - don't do this.
Stacktrace:
  [1] eval
    @ ./boot.jl:368 [inlined]
  [2] eval
    @ ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/ParallelKernel.jl:37 [inlined]
  [3] extract_args(call::Expr, macroname::Symbol)
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/shared.jl:100
  [4] hide_communication_cuda(boundary_width::Symbol, block::Expr)
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:118
  [5] hide_communication(::Symbol, ::Vararg{Union{Expr, Integer, Symbol}}; package::Symbol)
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:82
  [6] hide_communication(::Symbol, ::Vararg{Union{Expr, Integer, Symbol}})
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:81
  [7] var"@hide_communication"(__source__::LineNumberNode, __module__::Module, args::Vararg{Any})
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:68
...

This error only shows up when using @hide_communication inside a module where ParallelStencil was initialized with the CUDA.jl backend.

The cause of this error is the eval call inside the extract_args function that gets called inside the hide_communication_cuda function, currently implemented as:

extract_args(call::Expr, macroname::Symbol) = eval(substitute(deepcopy(call), macroname, Symbol("@get_args")))

This function is only called in the context of hide communication and the macroname is always Symbol("@parallel").

What this function does is replace the @parallel macro of the kernel call expression with the @get_args macro which is defined in the same file as extract_args like the following:

macro get_args(args...) return args end

and then calls eval to evaluate the new macro, hence getting a tuple with the arguments of the @parallel kernel call.
This eval call is problematic since it could change the ParallelStencil.ParallelKernel module from a third-party module while precompilation is occurring. Of course, no actual harm is done here, but Julia is (apparently?) not smart enough to notice this.

How to reproduce the error

Either:

clone AcousticWaveCPML.jl. Edit a multi-xPU shared solver (e.g. src/shared/acoustic_2D_multixPU.jl) and uncomment the @hide_communication call (e.g. line 160 and 172 in src/shared/acoustic_2D_multixPU.jl). Try to precompile the package and you should get the error.
write a package with a module that uses ParallelStencil inside of it and initialize ParallelStencil with the CUDA.jl backend. Write a @parallel kernel function and use @hide_communication to call that kernel function from within the module. Try to precompile the package and you should get the error.

Possible fixes

A possible fix is to call eval specifying the caller module to be the module where the @hide_communication macro is originally in. This does however require for the @get_args macro to be now accessible in the caller module, which is not in this case because it resides in the ParallelStencil.ParallelKernel module.

Another fix is to just replace the extract_args function with the following:

extract_args(call::Expr) = tuple(expr.args[3:end]...)

since we know that the only possible expression in expr will be a @parallel kernel call, we can extract the arguments of the call just like that without having to rely on eval. Maybe it would be nice to extract these arguments somewhere else (e.g. in the extract_calls function) where it is explicitly checked that expr is a @parallel kernel call.

This latter fix has been tested on both Base.Threads and CUDA.jl backends unit tests currently available in the package.

Update hide communication doc

Clarify in the documentation that the hide communication feature is currently only active and supported with the CUDA backend, as flagged by this Discourse post.

Generalize loopopt

Add device_sync

JuliaGPU switching to task local state, synchronize only syncs default stream on current task. This ma lead to conflicts between array programming operations executed on defaults and kernel programming executed on custom stream. Adding support for (heavy) device sync may be needed in some cases:
CUDA: CUDA.device_synchronize()
AMDGPU: AMDGPU.HIP.devide_synchronize()

sync issues on AMDGPU backend

Running large kernels on AMDGPU backend seem to experience issues with syncing. The behaviour is similar to #107 (comment).

The issue is that AMDGPU.HIPStream constructor is used instead of getting the default stream for the current task AMDGPU.stream().

Issues with backpropagation and adjoint construction

I couldn't get either methods for adjoint generation working over the @parallel stencil. For pure Zygote-based VJP calculations,

@parallel function diffusion3D_step!(T2, T, Ci, lam, dx, dy, dz)
    @inn(T2) = lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2);
    return
end

will fail because of the mutation. I was wondering if you were planning to support adjoint rules on @parallel constructors for this. That would be required for GPU usage.

Trying to avoid that issue, using ReverseDiffVJPs had an issue highlighted by the MWE:

const USE_GPU = false
using ParallelStencil, OrdinaryDiffEq
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 3);
else
    @init_parallel_stencil(Threads, Float64, 3);
end

@parallel function diffusion3D_step!(T2, T, Ci, lam, dx, dy, dz)
    @inn(T2) = lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2);
    return
end

function diffusion3D(lam,alg)
    # Physics
    cp_min     = 1.0;                                        # Minimal heat capacity
    lx, ly, lz = 10.0, 10.0, 10.0;                           # Length of domain in dimensions x, y and z.

    # Numerics
    nx, ny, nz = 16, 16, 16;                              # Number of gridpoints dimensions x, y and z.
    nt         = 100;                                        # Number of time steps
    dx         = lx/(nx-1);                                  # Space step in x-dimension
    dy         = ly/(ny-1);                                  # Space step in y-dimension
    dz         = lz/(nz-1);                                  # Space step in z-dimension

    # Array initializations
    T   = @zeros(nx, ny, nz);
    T2  = @zeros(nx, ny, nz);
    Ci  = @zeros(nx, ny, nz);

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    Ci .= 1.0./( cp_min .+ Data.Array([5*exp(-(((ix-1)*dx-lx/1.5))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) +
                                       5*exp(-(((ix-1)*dx-lx/3.0))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)]) )
    T  .= Data.Array([100*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/3.0)/2)^2) +
                       50*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
    T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.

    dt = min(dx^2,dy^2,dz^2)*cp_min/8.1;                 # Time step for the 3D Heat diffusion
    function f(du,u,p,t)
        @show t
        @parallel diffusion3D_step!(du, u, Ci, p[1], dx, dy, dz);
    end
    prob = ODEProblem(f, T, (0.0,nt*dt), lam)
    sol = solve(prob, alg, save_everystep = false, save_start = false)
end

sol = diffusion3D([1.0],ROCK2())

using ForwardDiff, Zygote, DiffEqSensitivity
function loss(p)
    sum(diffusion3D(p,ROCK2()))
end
ForwardDiff.gradient(loss,[1.0])
Zygote.gradient(loss,[1.0])

TaskFailedException

    nested task error: BoundsError: attempt to access 22080-element Vector{ReverseDiff.AbstractInstruction} at index [474]
    Stacktrace:
      [1] setindex!
        @ .\array.jl:839 [inlined]
      [2] push!
        @ .\array.jl:930 [inlined]
      [3] record!
        @ C:\Users\accou\.julia\packages\ReverseDiff\E4Tzn\src\tape.jl:10 [inlined]
      [4] ForwardOptimize
        @ C:\Users\accou\.julia\packages\ReverseDiff\E4Tzn\src\macros.jl:105 [inlined]
      [5] -
        @ C:\Users\accou\.julia\packages\ReverseDiff\E4Tzn\src\derivatives\scalars.jl:9 [inlined]
      [6] macro expansion
        @ C:\Users\accou\.julia\packages\ParallelStencil\0VULM\src\parallel.jl:103 [inlined]
      [7] macro expansion
        @ D:\OneDrive\Computer\Desktop\test.jl:11 [inlined]
      [8] macro expansion
        @ C:\Users\accou\.julia\packages\ParallelStencil\0VULM\src\ParallelKernel\parallel.jl:330 [inlined]
      [9] (::var"#567#threadsfor_fun#7"{Array{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}}, 3}, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, Array{Float64, 3}, ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}, Float64, Float64, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, UnitRange{Int64}})(onethread::Bool)
        @ Main .\threadingconstructs.jl:81
     [10] (::var"#567#threadsfor_fun#7"{Array{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}}, 3}, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, Array{Float64, 3}, ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}, Float64, Float64, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, UnitRange{Int64}})()
        @ Main .\threadingconstructs.jl:48
wait at task.jl:322 [inlined]
threading_run(func::Function) at threadingconstructs.jl:34
macro expansion at threadingconstructs.jl:93 [inlined]
macro expansion at parallel.jl:328 [inlined]
diffusion3D_step! at shared.jl:77 [inlined]
(::var"#f#12"{Array{Float64, 3}, Float64, Float64, Float64})(du::Array{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}}, 3}, u::ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, p::ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, t::ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}) at test.jl:42
ODEFunction at scimlfunctions.jl:334 [inlined]
(::DiffEqSensitivity.var"#33#37"{ODEFunction{true, var"#f#12"{Array{Float64, 3}, Float64, Float64, Float64}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}})(u::ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, p::ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, t::ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}) at derivative_wrappers.jl:345
ReverseDiff.GradientTape(f::Function, input::Tuple{Array{Float64, 3}, Vector{Float64}, Vector{Float64}}, cfg::ReverseDiff.GradientConfig{Tuple{ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}}) at tape.jl:207
GradientTape at tape.jl:204 [inlined]
_vecjacobian!(dλ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, y::Array{Float64, 3}, λ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, p::Vector{Float64}, t::Float64, S::DiffEqSensitivity.ODEInterpolatingAdjointSensitivityFunction{DiffEqSensitivity.AdjointDiffCache{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, ReverseDiff.GradientTape{DiffEqSensitivity.var"#93#104"{ODEFunction{true, var"#f#12"{Array{Float64, 3}, Float64, Float64, Float64}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}}, Tuple{ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}, Vector{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}}}, Nothing, Noth...

while forward-mode runs, it looks like the tracing required for adjoints causes issues with the threading constructs.

[New Miniapp] Time-varying pressure source for acoustic2D

Not sure if you want to avoid domain specific miniapps, but I am working towards modifying the acoustic 2D miniapp for ultrasound and photoacoustic modeling. Currently I just played around with getting a time-varying source working.

My current implementation just sets the pressure at a given pixel based on some source signal vector.

freq = 1/dt/5 # divide by 5 to stay below the nyquist limit of the grid's sampling frequency
trange = range(0,2pi,length=nt)
amp = 1
signal = amp*sin.(freq*trange)

# Time loop
for it = 1:nt
        P[nx÷2,ny÷2]=signal[it] # Set source signal value
        @parallel compute_V!(Vx, Vy, P, dt, ρ, dx, dy)
        @parallel compute_P!(P, Vx, Vy, dt, k, dx, dy)
        
        # Visualisation
 end

I can clean my code up a bit and submit it as an acoustic2D with time varying source miniapp, but I wanted to see if this would be a recommended way of implementing a time-varying source? Or if this poses obvious performance penalties that users shouldn't be exposed to.

Heterogenous stencil formulations

TLDR; can PS run multiple distinct kernels on different SMs within a hide_communication block?

One of the biggest advantages to finite-difference-based codes is the ability to add a wide variety of "physics" into the domain by spatially modifying the underlying stencil. For example, different materials may contain additional PDE terms (i.e. additional storage requirements) that are only needed at certain regions in the grid.

Computationally, this means we can define various kernels across the grid that handle the local effects.

Of course, we could just opt to use the most general kernel across the whole grid, but this significantly reduces the potential performance of the code.

Naively, we might just write a series of case statements within our global looping kernel to handle the heterogenous effects. However, extensive branching will severely hinder the performance of the code. Furthermore, it's difficult to handle local memory effects. In other words, we don't want to define large (and computationally dense) arrays who's data is actually quite sparse.

Instead, it's common practice to "chunk" the domain, such that each chunk uses the same stencil (i.e. the same underlying kernel). This way, arrays can be assigned to each chunk, rather than on the global domain.

In addition, GPUs can process different kernels simultaneously (e.g. using multiple SMs). Currently, PS uses this approach to handle the hide_communication feature. In some cases, it even updates the boundaries of the domain within the same timeframe (before a synch is performed).

In the ETHZ course, there is a small section describing a current limitation with this approach:

Is it possible to apply other, more complicated kernels here?

Possible race condition

Computing Eta and using its averaged value in the same kernel may lead to erroneous results because of missing sync.

ParallelStencil.jl/miniapps/ThermalConvection2D.jl

Line 28 in b985f8f

@all(σxy) = 2.0*@av(Eta)*(0.5*(@d_yi(Vx)/dy + @d_xi(Vy)/dx))

Eta needs to be computed in another kernel.

3d image application

Hello I am trying to apply this library to 3d medical imaging, as it seems ideal yet I have problems with application of code to my domain.
For the beginning I would like to achieve very simple thing just calculate the mean and standard deviation of neighbouring pixels so pseudo code would look like.

# dimensionality increases as input volume holds just intensity , when output volume holds array with mean and standard deviation
function getMeanAndStd ( data ::Array{Int32, 3} ) :: Array{Int32,4}

  out = Array{Int32,4} # preallocating output

   # for example i would use Neumann style stencil and in this naive idea Cartesian index of the center of the stencil
  for ((nsten, cartIndex) in data) 

    flattened = flaten(nsten)
    out[cartIndex] = [ mean(flattened), std(flattened) ] # putting the result in proper place of output

  end
  return out
end

Of course In case of big images i would need to batch the data as I could have a problem in fitting it in GPU memory also as seen from output here further analysis would be in 4d .

I just wanted to ask for some guide How to achieve what I had written above so hopefully I will be able to go on further on my own and later share effects of the work with scientific community.

Thanks for Help!

Support for Complex arrays

Fantastic module.

Will there be (or is there already, somewhere hidden?) support for complex type arrays. I would love to use this to perform simulations on the Nonlinear Schrodinger Equation (specifically Gross-Pitaevskii), but that isn't possible without this feature.

Add indent

One may want to add indention for enhance readability

ParallelStencil.jl/examples/parametrized/diffusion3D_multigpucpu_novis.jl

Line 18 in a5b9ed5

# Numerics

typo in error message

ParallelStencil.jl/src/init_parallel_stencil.jl

Line 67 in 2902a8c

    
           @IncoherentCallError("ParallelStencil has already been initialized, with different arguments. If you are using ParallelStencil interactively in the REPL and want to avoid restarting Julia, then you can call ParalllelStencil.@reset_parallel_stencil() and rerun all parts of your code that use ParallelStencil features (including kernel definitions and array allocations). If you are using ParallelStencil non-interactively, then you are using ParallelStencil in an invalid way: @init_parallel_stencil should only be called once, right after 'using ParalllelStencil'.")

ParalllelStencil.@reset_parallel_stencil() should be replaced with ParallelStencil.@reset_parallel_stencil()(Parallel mis-spelled)

Make CellArrays mutable

I found the documentation on the @CellType macro to be quite interesting and potentially useful for me.

For example, if I am trying to solve a system of tensor equations, then having this ability to store the elements of a symmetric tensor on the grid like the given example @CellType SymmetricTensor2D fieldnames=(xx, zz, xz) would be useful.

But I found that (as far as I can tell from the docs and my testing) there is no way to make the cell type mutable, which is necessary when the tensor makes up the state vector of my system of equations, and thus must be modified at each time step. Since this ultimately comes from StaticArrays.jl, there are mutable types available in there, so this functionality could be added.

Also, it would be nice to have the CellType act as the tensor it is meant to represent, while still only storing the needed values. For example, defining @CellType SymmetricTensor2D fieldnames=(xx, zz, xz) I could then overload getindex

@inline function Base.getindex(A::SymmetricTensor2D,μ::Int64,ν::Int64)
    (μ,ν) == (1,1) && return A.xx
    (μ,ν) == (2,2) && return A.zz
    (μ,ν) == (1,2) && return A.xz
    (μ,ν) == (2,1) && return A.xz
end

I don't know if this is the best way to go about this, although it does work. It might be nice to be able to define an index mapping in @CellType so that general tensors can index however they are needed to.

OOM error when running with more than one mpi rank

Hello,

Thank you for a great package! I am trying one of your examples and it runs fine with one mpi process (never runs out of memory). But fails on an OOM error with more mpi processes (I checked, the devices are less than 1% memory occupied).

With one process, success:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 1 julia --project=@. ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl
Global grid: 127x127x127 (nprocs: 1, dims: 1x1x1)
Animation directory: ./viz3D_out/
Total steps=1000, time=1.636e+01 sec (@ T_eff = 7.90 GB/s)

With 2 processes, failure:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl
Global grid: 252x127x127 (nprocs: 2, dims: 2x1x1)
ERROR: LoadError: Out of GPU memory trying to allocate 15.628 MiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)
Stacktrace:
 [1] #alloc#202
   @./CUDA/mVgLI/src/pool.jl:266 [inlined]
 [2] alloc
   @ ./CUDA/mVgLI/src/pool.jl:258 [inlined]
 [3] CUDA.CuArray{Float64, 3}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64, Int64})
   @ ./CUDA/mVgLI/src/array.jl:28
 [4] CuArray
   @ ./CUDA/mVgLI/src/array.jl:109 [inlined]
 [5] CuArray
   @./CUDA/mVgLI/src/array.jl:110 [inlined]
 [6] zeros
   @ ./CUDA/mVgLI/src/array.jl:409 [inlined]
 [7] acoustic3D()
   @ Main ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl:40
 [8] top-level scope
   @ ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl:86

I appreciate any hints you may have about why this is happening.

Cheers!

Error copying Cuda Array to Array when using GPU

Hi!
First I'll thank you all for this amazing module that I just discovered and enjoy so much.
I am very new to Julia and started to play around with some easy GPU computing that offers parallel_stencil.jl, but came across error when running one of the examples provided (examples/diffusion3D_multigpucpu_hidecomm.jl):

    T_nohalo .= T[2:end-1,2:end-1,2:end-1];                                           # Copy data to CPU removing the halo.

where T would be a CUDA Array (selected by parallel_stencil) and T_nohallo a "standard" Array

Causing the following Error :

ERROR: LoadError: This object is not a GPU array
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] backend(#unused#::Type)
@ GPUArrays ~/.julia/packages/GPUArrays/VNhDf/src/device/execution.jl:15
[3] backend(x::Array{Float64, 3})
@ GPUArrays ~/.julia/packages/GPUArrays/VNhDf/src/device/execution.jl:16
[4] _copyto!
@ ~/.julia/packages/GPUArrays/VNhDf/src/host/broadcast.jl:73 [inlined]
[5] materialize!
@ ~/.julia/packages/GPUArrays/VNhDf/src/host/broadcast.jl:51 [inlined]
[6] materialize!(dest::Array{Float64, 3}, bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{3}, Nothing, typeof(identity), Tuple{CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}})
@ Base.Broadcast ./broadcast.jl:868
[7] diffusion3D()
@ Main /.../diffusion3D_multigpucpu_hidecomm.j:60
[8] top-level scope
@ /.../diffusion3D_multigpucpu_hidecomm.jl:84
in expression starting at /.../diffusion3D_multigpucpu_hidecomm.j:84

So from my basic understanding it would appear that parallel_stencil doesn't allow interoperability between CUDA Arrays and Standard ones for broadcasting, is it no longer supported?
Sorry in advance if this is a dumb issue, I have yet to find a workaround.

Include `@tturbo` as loop vectorisation possibility for the CPU backend

Something to consider as alternative or supplement to the current Threads.@threads option. The @tturbo macro allows for threaded aux instruction exposed by the LoopVectorization package. See here https://github.com/luraess/parallel-gpu-workshop-JuliaCon21#parallel-cpu-implementation for an example. There may be some restrictions on handling if conditions inside the loop.

Non cartesian gather!

Hi,

I have an array quantity a computed on every process and I need to gather all the aon the process 0.
Can I do this gathering operation with ParallelStencil.jl or should I switch to plain MPI programming ?

Thank you for your help.

dimensionality restrictions

I'm considering to use this package for a PDE formulation where the grid is in a four dimensional space, therefore, I'd like to ask how hard is the dimensionality restriction (at least according to the readme) of 1,2 or 3D present in the code?

Passing `struct`s into `@parallel` functions

I have several @parallel functions that each have several arguments (of type Data.Array). To make the code cleaner, I (naively) tried passing a single struct containing all of these arguments. Unfortunately, I get the following error:

ERROR: LoadError: MethodError: no method matching var"@within"(::LineNumberNode, ::Module, ::String, ::Expr)
Closest candidates are:
  var"@within"(::LineNumberNode, ::Module, ::String, ::Symbol) at ~/.julia/packages/ParallelStencil/3flwf/src/FiniteDifferences.jl:153

Fundamentally, I'm assuming this has something to do with how the struct lives in memory and isn't compatible with the the way ParallelStencil distributes Data.Arrays. Is there a way around this? Perhaps there's a more "canonical" way to accomplish this?

Could you also explain the role of the @within macro, so that I can better understand the limitations of @parallel functions (and how to properly use them)?

Thanks for the great package!

Create and update GPU unit tests

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Can I use this package to solve linear equation A*x=b in GPU? If can, please give me some instructions. Thank you very much.

Optimal Data Layout

Congrats and many thanks for your super impressive and useful package !

I have a few questions :

Is there a doc/videos/forum threads explaining the internal design choices ?
I was wondering about the internal data layout of fields: are they organized to adapt to the cache hierarchy and SIMD width of computing targets ? Do you think it would make sense ?
The README states that time blocking should not be interesting for real applications. Could you elaborate ? I thought that small blocks with halo may perform several time-steps before communication and may help to reduce the memory bandwidth pressure. Do you think it would make sense ?

Best,

Laurent

Test against cudnn?

I'm curious what GPU performance you get against something like the cudnn wrappers of NNlibCUDA.jl. Those would be more appropriate comparisons than CuArray broadcasting.

Debugging and profiling workflow

Is there a recommended workflow when trying to debug and profile @parallel_indices functions?

Often, my functions reference several other functions (which typically act as function barriers). If I want to profile these nested functions, or even the parent function, I run into various problems due to the way the macro extracts out code.

For example, it would be great if I could use something like Cthulhu.jl to recursively evaluate the @code_warntype output (and check for dynamic dispatching etc.) but this currently isn't possible. More importantly, none of the current Julia debuggers can traverse into any of the kernels. In other words, I can't seem to find a way to inspect the current running state of the kernel (which is also important for checking for dynamic dispatch and specialization errors).

I could copy the body contents into a "custom" function that implements a manual loop over the indices... but this seems a little too brute force.

CUDA Crash with julia 1.9.0

Hi,
I tried to run acoustic3D.jl miniapp using CUDA (USE_GPU = true).
Everything is fine with julia 1.8.5 but crash with julia 1.9

julia> include("acoustic3D.jl")
┌ Warning: ParallelStencil has already been initialized, with the same arguments. If you are using ParallelStencil interactively in the REPL, then you can ignore this message. If you are using ParallelStencil non-interactively, then you are likely using ParallelStencil in an inconsistent way: @init_parallel_stencil should only be called once, right after 'using ParallelStencil'.
└ @ ParallelStencil ~/.julia/packages/ParallelStencil/fQa5L/src/init_parallel_stencil.jl:73
┌ Warning: Module Data from previous module initialization found in caller module (Main); module Data not created. If you are working interactively in the REPL, then you can ignore this message.
└ @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/fQa5L/src/ParallelKernel/init_parallel_kernel.jl:33
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemAllocAsync(dptr::Base.RefValue{CUDA.CuPtr{Nothing}}, bytesize::Int64, hStream::CUDA.CuStream)
    @ CUDA ~/.julia/packages/CUDA/BbliS/lib/utils/call.jl:26
  [4] #alloc#1

...

(test_pstencil) pkg> status
Status `~/temp/test_pstencil/Project.toml`
⌅ [052768ef] CUDA v3.13.1
  [94395366] ParallelStencil v0.6.1
  [91a5bcdd] Plots v1.38.11
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, goldmont)
  Threads: 8 on 32 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
  JULIA_IMAGE_THREADS = 1

julia> using CUDA

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 530.41.3, for CUDA 12.1
CUDA driver 12.1

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 12.0.0+530.41.3
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.9.0
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 4070 (sm_89, 9.942 GiB / 11.994 GiB available)

julia> Unhandled Task ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:

Improving loop kernel speed

I have two versions of a relatively simple (2D) finite-difference code, one written in c++, the other using ParallelStencil. The c++ code is consistently 6x-8x faster than the Julia implementation (on a single thread of my M1 cpu). I'm using "cells/second" as my benchmarking metric. It's slightly different than the throughput metric (T) used in the PS publications, but still proportional to the same quantity and a little more meaningful in this context.

I have two theories as to why the Julian code is slower, but I'd like to get some feedback.

branching. First, my c++ implementation has a pretty tight loop. No branching whatsoever. All of the branching is outside the loop (resulting in some significant code repetition). In my PS implementation, I'm using a @parallel_indices kernel that does have an if-statement to ensure the loop indices don't step out of bounds (I'm using a staggered grid, and not all arrays have the same size). I did some preliminary work to refactor the edges of each loop so that I could factor out the branching, but I haven't seen any significant speed improvements. Are there any other case statements hidden within the macros I may not be aware of?
SIMD support. My c++ variant has explicit simd optimizations using some compiler macros. Does PS help the compiler out in this regard? I realize this is tricky with the branching I had earlier... but before I arduously refactor all my code, I want to make sure PS is even simd aware.

Thanks for your help!

Add working Julia environment(s) for the examples (toml files)

Currently, one can't run the examples without manually adding the necessary packages (e.g. ImplicitGlobalGrid or MPI). Would be great to have toml files in examples/ that make it a simple ] instantiate.

How to implement custom finite differencing operators

I am interested in using this package to multithread some hyperbolic PDEs starting from the example of the 1D acoustic wave equation. The first step is to figure out how to implement my own finite difference operators (I use summation by parts operators that are not diagonal for example)

The documentation states that "Custom macros to extend the finite differences submodules or for other stencil-based numerical methods can be readily plugged in"

It is not obvious to me how to do this. Say I have a non-diagonal but sparse matrix D that I want to apply to the current state vector, how can I write a macro to do this?

Add support for CUDA 3

Julia 1.6 not supported due to outdated @within signature check

> julia test_parallel.jl 
ERROR: LoadError: LoadError: LoadError: MethodPluginError: the signature of the macro @within is not compatible with ParallelStencil (detected signature: "# 1 method for macro "@within":
[1] var"@within"(__source__::LineNumberNode, __module__::Module, macroname::String, A::Symbol) in ParallelStencil.FiniteDifferences3D at /users/omlins/.julia/dev/ParallelStencil/src/FiniteDifferences.jl:308"). The signature must correspond to the description in ParallelStencil.WITHIN_DOC. See in ParallelStencil.FiniteDifferences{1|2|3}D for examples.
Stacktrace:
  [1] check_mask_macro(caller::Module)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:89
  [2] parallel_kernel(caller::Module, package::Symbol, numbertype::DataType, ndims::Int64, kernel::Expr)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:76
  [3] parallel(caller::Module, args::Expr; package::Symbol)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:43
  [4] parallel(caller::Module, args::Expr)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:40
  [5] var"@parallel"(__source__::LineNumberNode, __module__::Module, args::Vararg{Any, N} where N)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:14
  [6] eval
    @ ./boot.jl:360 [inlined]
  [7] eval(x::Expr)
    @ Base.MainInclude ./client.jl:446
  [8] top-level scope
    @ ~/.julia/dev/ParallelStencil/test/test_parallel.jl:11
  [9] eval(m::Module, e::Any)
    @ Core ./boot.jl:360
 [10] var"@static"(__source__::LineNumberNode, __module__::Module, ex::Any)
    @ Base ./osutils.jl:19

@parallel_indices exception in package precompilation

Hey there!

Having a great time using this package, but I have started to include ParallelStencils into a package that I am writing, and that involves tagging some functions with @parallel_indices. During precompilation to test an example I have written it errors out as I obviously haven't created a stencil yet. What's curious is that functions tagged with @parallel don't cause the same error.

Is there anything I need to declare/ have missed within my own package to stop this error?

Apologies if this is a daft question, I haven't got a lot of experience with GPU stuff!

Cheers,

Dan

Creating modules that depend on `ParallelStencil.jl`

Suppose I wanted to create a module that depends on ParallelStencil.jl (e.g. a self-contained finite-difference code). What's the best way to do this?

Normally this is rather straightforward... but the primary source of ambiguity for me is how to best handle the @init_parallel_stencil() macro.

Naively, I might just call that macro at the top of the script which calls the new module, but it gets hairy when you need to access the created Data class inside of your new module (there's a lot of redundant passing that occurs).

Perhaps a more elegant scenario is to call that macro within the module itself (but then you have a chicken-egg problem when trying to specify the parameters of that macro...)?

I'm curious to see how others tackle this.

Disable subnormals inside @parallel blocks

I'm able to disable subnormals (e.g. set_zero_subnormals(true)) within @parallel_indices blocks, but not parallel blocks. Is there an easy way to get around this? Thanks!

Code generation for multiple stencils

Often, a particular stencil-based code can contain dozens or even hundreds of variations of the most "complete" stencil. For example, depending on what parameters the simulation is running, it may only need certain aspects of the most complete case.

In C/C++, many codes use some sort of code generation/ metaprogramming to handle the combinatorial explosion that occurs in these cases.

Has anyone explored @parallel kernel generation in this context? I tried creating a @generated function from a @parallel construct, but was running into some issues.

Just curious if others have thought of this. Thanks!

CUDA 3.0.2 triggers error in REPL

CUDA v3.0.2 fix of

Synchronize REPL expressions before returning (#837)

produces now an error within ParallelStencil when running code on the Base.Threads backend in the REPL when CUDA isn't functional (i.e. no libcuda is available).

This if statement addition in CUDA.jl (src/initialization.jl) seems to be responsible for this.

Note that ParallelStencil has CUDA as dependency but won't use it unless activated in @init_parallel_stencil()

Error stack trace:

ERROR: CUDA.jl did not successfully initialize, and is not usable.
If you did not see any other error message, try again in a new session
with the JULIA_DEBUG environment variable set to 'CUDA'.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] libcuda()
    @ CUDA ~/.julia/packages/CUDA/MpASK/src/initialization.jl:52
  [3] macro expansion
    @ ~/.julia/packages/CUDA/MpASK/lib/cudadrv/libcuda.jl:29 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/CUDA/MpASK/lib/cudadrv/error.jl:94 [inlined]
  [5] cuDeviceGet
    @ ~/.julia/packages/CUDA/MpASK/lib/utils/call.jl:26 [inlined]
  [6] CuDevice
    @ ~/.julia/packages/CUDA/MpASK/lib/cudadrv/devices.jl:25 [inlined]
  [7] CUDA.TaskLocalState()
    @ CUDA ~/.julia/packages/CUDA/MpASK/src/state.jl:50
  [8] task_local_state!()
    @ CUDA ~/.julia/packages/CUDA/MpASK/src/state.jl:73
  [9] stream
    @ ~/.julia/packages/CUDA/MpASK/src/state.jl:419 [inlined]
 [10] synchronize()
    @ CUDA ~/.julia/packages/CUDA/MpASK/lib/cudadrv/stream.jl:117
 [11] top-level scope
    @ ~/.julia/packages/CUDA/MpASK/src/initialization.jl:83

Spatially varying parameters?

Hi Ludovic and Samuel,

Hope you stepped into the new year with ease. I have experienced with your julia software in a while, thanks to your clear documentation and workflow!

I have a couple of questions about https://github.com/omlins/ParallelStencil.jl/blob/main/miniapps/HydroMech2D.jl
while I am using the software:

How do we allow the background porosity distribution to be spatially varying? Right now the background porosity is set to be homogenous as $\phi_0$ and only accepts scalar (since other physical parameters also depend on it, e.g. time step). Same for density, etc.
When simulating with larger background porosity (e.g. 0.2) and larger perturbation (e.g. 0.4), the behavior does not look like chimneys but more like "blobs". Is this behavior expected? Also, do I have to change the physical time stepping and reduction time stepping accordingly?

Thanks a lot for your help!

Kind Regards,

Francis

GPUCompiler error when running 3D diffusion example on the GPU

While trying out the provided example, I receive the following error (Julia 1.8):

ERROR: MethodError: no method matching return_types(::GPUArrays.var"#5#6", ::Type{Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 3, 1}, Float64}}, ::GPUCompiler.GPUInterpreter)
Closest candidates are:
  return_types(::Any, ::Any; world, interp) at reflection.jl:1294
  return_types(::Any) at reflection.jl:1294

Handling boundaries

For fun I was putting together some method of lines demos. Here's ParallelStencil.jl used with Runge-Kutta Chebyshev methods:

const USE_GPU = true
using ParallelStencil, OrdinaryDiffEq
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 3);
else
    @init_parallel_stencil(Threads, Float64, 3);
end

@parallel function diffusion3D_step!(T2, T, Ci, lam, dx, dy, dz)
    @inn(T2) = lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2);
    return
end

function diffusion3D(alg)
    # Physics
    lam        = 1.0;                                        # Thermal conductivity
    cp_min     = 1.0;                                        # Minimal heat capacity
    lx, ly, lz = 10.0, 10.0, 10.0;                           # Length of domain in dimensions x, y and z.

    # Numerics
    nx, ny, nz = 256, 256, 256;                              # Number of gridpoints dimensions x, y and z.
    nt         = 100;                                        # Number of time steps
    dx         = lx/(nx-1);                                  # Space step in x-dimension
    dy         = ly/(ny-1);                                  # Space step in y-dimension
    dz         = lz/(nz-1);                                  # Space step in z-dimension

    # Array initializations
    T   = @zeros(nx, ny, nz);
    T2  = @zeros(nx, ny, nz);
    Ci  = @zeros(nx, ny, nz);

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    Ci .= 1.0./( cp_min .+ Data.Array([5*exp(-(((ix-1)*dx-lx/1.5))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) +
                                       5*exp(-(((ix-1)*dx-lx/3.0))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)]) )
    T  .= Data.Array([100*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/3.0)/2)^2) +
                       50*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
    T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.
    function f(du,u,p,t)
        @show t
        @parallel diffusion3D_step!(du, u, Ci, lam, dx, dy, dz);
    end
    prob = ODEProblem(f, T, (0.0,1.0))
    sol = solve(prob, alg, save_everystep = false, save_start = false)
end

sol = diffusion3D(ROCK2())
using Plots
heatmap(Array(sol[end]))

However, I couldn't figure out the API for specifying BCs. Is there a way to specifically mention the boundary stencil or do you have to just write a loop after the interior calculation?

Question about @inn macro

I am relatively new to metaprogramming in Julia so this may simply be a personal issue.

With that acknowledged, I am struggling to understand how the @inn macro in FiniteDifferences.jl behaves as advertised. The documentation defines @doc "@inn(A): Select the inner elements of A. Corresponds to A[2:end-1]" :(@inn). The relevant definitions in the file include
const ix = INDICES[1]
const ixi = :($ix+1)
...
macro all(A::Symbol) esc(:( $A[$ix ] )) end
macro inn(A::Symbol) esc(:( $A[$ixi ] )) end
From this definition, it seems that @inn would simply shift all of the indices of A::Symbol by one value (resulting in an out of bounds error). Considering that all of the miniapps function as intended, I know I must be missing something but I can't for the life of me figure out how it works.

Thread (CPU) Float32/Float64 performance comparison on miniapp acoustic2D

Hi,

I am a bit surprised by the performance comparison of miniapp acoustic2D.jl when I switch from original Float64 to Float32 FP precision.
I use this settings :

 # Numerics
    nx, ny    = 2047, 2047    # numerical grid resolution; should be a mulitple of 32-1 for optimal GPU perf
    nt        = 1000        # number of timesteps
    nout      = 1000          # plotting frequency

On my mac M1 max I obtain (julia 1.8.5 with 8 threads and no boundcheck)

Float64

Animation directory: ./viz2D_out/
Total steps=1000, time=4.920e+00 sec (@ T_eff = 40.00 GB/s) 
[ Info: Saved animation to /Users/laurentplagne/Projects/ParallelStencil.jl/miniapps/acoustic2D.gif

and with Float32

Animation directory: ./viz2D_out/
Total steps=1000, time=4.534e+00 sec (@ T_eff = 22.00 GB/s) 
[ Info: Saved animation to /Users/laurentplagne/Projects/ParallelStencil.jl/miniapps/acoustic2D.gif

So, basically no speed up at all, although I assumed that this case was strongly memory bound and would have expect Float32 to run twice as fast as Float64.

Last, the comment for the Teff computation seems to be deprecated

  A_eff    = (3*2)/1e9*nx*ny*sizeof(Data.Number)  # Effective main memory access per iteration [GB] (Lower bound of required memory access: H and dHdτ have to be read and written (dHdτ for damping): 4 whole-array memaccess; B has to be read: 1 whole-array memaccess)

Any hint ?
P.S. Here is the slithly modified miniapp (with a variable FP)

const USE_GPU = false  # Use GPU? If this is set false, then no GPU needs to be available
using ParallelStencil
using ParallelStencil.FiniteDifferences2D

const FP = Float64

@static if USE_GPU
    @init_parallel_stencil(CUDA, FP, 2)
else
    @init_parallel_stencil(Threads, FP, 2)
end
using Plots, Printf, Statistics

@parallel function compute_V!(Vx::Data.Array, Vy::Data.Array, P::Data.Array, dt::Data.Number, ρ::Data.Number, dx::Data.Number, dy::Data.Number)
    @inn(Vx) = @inn(Vx) - dt/ρ*@d_xi(P)/dx
    @inn(Vy) = @inn(Vy) - dt/ρ*@d_yi(P)/dy
    return
end

@parallel function compute_P!(P::Data.Array, Vx::Data.Array, Vy::Data.Array, dt::Data.Number, k::Data.Number, dx::Data.Number, dy::Data.Number)
    @all(P) = @all(P) - dt*k*(@d_xa(Vx)/dx + @d_ya(Vy)/dy)
    return
end

##################################################
@views function acoustic2D()
    # Physics
    lx, ly    = 40.0, 40.0  # domain extends
    k         = 1.0         # bulk modulus
    ρ         = 1.0         # density
    t         = 0.0         # physical time
    # Numerics
    nx, ny    = 2047, 2047    # numerical grid resolution; should be a mulitple of 32-1 for optimal GPU perf
    nt        = 1000        # number of timesteps
    nout      = 1000          # plotting frequency
    # Derived numerics
    dx, dy    = lx/(nx-1), ly/(ny-1) # cell sizes
    # Array allocations
    P         = @zeros(nx  ,ny  )
    Vx        = @zeros(nx+1,ny  )
    Vy        = @zeros(nx  ,ny+1)
    # Initial conditions
    P        .= Data.Array([exp(-((ix-1)*dx-0.5*lx)^2 -((iy-1)*dy-0.5*ly)^2) for ix=1:size(P,1), iy=1:size(P,2)])
    dt        = min(dx,dy)/sqrt(k/ρ)/4.1
    # Preparation of visualisation
    ENV["GKSwstype"]="nul"; if isdir("viz2D_out")==false mkdir("viz2D_out") end; loadpath = "./viz2D_out/"; anim = Animation(loadpath,String[])
    println("Animation directory: $(anim.dir)")
    X, Y      = -lx/2:dx:lx/2, -ly/2:dy:ly/2
    # Time loop
    for it = 1:nt
        if (it==11)  global wtime0 = Base.time()  end
        @parallel compute_V!(Vx, Vy, P, FP(dt), FP(ρ), FP(dx), FP(dy))
        @parallel compute_P!(P, Vx, Vy, FP(dt), FP(k), FP(dx), FP(dy))
        t = t + dt
        # Visualisation
        if mod(it,nout)==0
            heatmap(X, Y, Array(P)', aspect_ratio=1, xlims=(X[1],X[end]), ylims=(Y[1],Y[end]), c=:viridis, title="Pressure"); frame(anim)
        end
    end
    # Performance
    wtime    = Base.time()-wtime0
    A_eff    = (3*2)/1e9*nx*ny*sizeof(Data.Number)  # Effective main memory access per iteration [GB] (Lower bound of required memory access: H and dHdτ have to be read and written (dHdτ for damping): 4 whole-array memaccess; B has to be read: 1 whole-array memaccess)
    wtime_it = wtime/(nt-10)                        # Execution time per iteration [s]
    T_eff    = A_eff/wtime_it                       # Effective memory throughput [GB/s]
    @printf("Total steps=%d, time=%1.3e sec (@ T_eff = %1.2f GB/s) \n", nt, wtime, round(T_eff, sigdigits=2))
    gif(anim, "acoustic2D.gif", fps = 15)
    return
end

acoustic2D()

omlins / parallelstencil.jl Goto Github PK

parallelstencil.jl's Issues

Error description

How to reproduce the error

Possible fixes

Recommend Projects

Recommend Topics

Recommend Org