Giter Site home page Giter Site logo

omlins / parallelstencil.jl Goto Github PK

View Code? Open in Web Editor NEW
286.0 10.0 30.0 41.42 MB

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs

License: BSD 3-Clause "New" or "Revised" License

Julia 100.00%
julia gpu stencil-codes parallel multi-gpu staggered-grids cuda multi-xpu xpu

parallelstencil.jl's Introduction

ParallelStencil.jl ParallelStencil.jl

Build Status

ParallelStencil empowers domain scientists to write architecture-agnostic high-level code for parallel high-performance stencil computations on GPUs and CPUs. Performance similar to CUDA C / HIP can be achieved, which is typically a large improvement over the performance reached when using only CUDA.jl or AMDGPU.jl GPU Array programming. For example, a 2-D shallow ice solver presented at JuliaCon 2020 [1] achieved a nearly 20 times better performance than a corresponding GPU Array programming implementation; in absolute terms, it reached 70% of the theoretical upper performance bound of the used Nvidia P100 GPU, as defined by the effective throughput metric, T_eff (note that T_eff is very different from common throughput metrics, see section Performance metric). The GPU performance of the solver is reported in green, the CPU performance in blue:

Performance ParallelStencil Teff

ParallelStencil relies on the native kernel programming capabilities of CUDA.jl and AMDGPU.jl and on Base.Threads for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with ImplicitGlobalGrid.jl, which renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs [1, 2, 3, 4]. Moreover, ParallelStencil enables hiding communication behind computation with a simple macro call and without any particular restrictions on the package used for communication. ParallelStencil has been designed in conjunction with ImplicitGlobalGrid.jl for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance multi-GPU applications readily accessible to them. Furthermore, we have developed a self-contained approach for "Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia" relying on ParallelStencil and ImplicitGlobalGrid.jl [1]. ParallelStencil's feature to hide communication behind computation was showcased when a close to ideal weak scaling was demonstrated for a 3-D poro-hydro-mechanical real-world application on up to 1024 GPUs on the Piz Daint Supercomputer [1]:

Parallel efficiency of ParallelStencil with CUDA C backend

A particularity of ParallelStencil is that it enables writing a single high-level Julia code that can be deployed both on a CPU or a GPU. In conjuction with ImplicitGlobalGrid.jl the same Julia code can even run on a single CPU thread or on thousands of GPUs/CPUs.

Beyond traditional high-performance computing, ParallelStencil supports automatic differentiation of architecture-agnostic parallel kernels relying on Enzyme.jl, enabling both high-level and generic syntax for maximal flexibility.

Contents

Parallelization and optimization with one macro call

A simple call to @parallel is enough to parallelize and optimize a function and to launch it. The package used underneath for parallelization is defined in a call to @init_parallel_stencil beforehand. Supported are CUDA.jl and AMDGPU.jl for running on GPU and Base.Threads for CPU. The following example outlines how to run parallel computations on a GPU using the native kernel programming capabilities of CUDA.jl underneath (omitted lines are represented with #(...), omitted arguments with ...):

#(...)
@init_parallel_stencil(CUDA,...)
#(...)
@parallel function diffusion3D_step!(...)
    #(...)
end
#(...)
@parallel diffusion3D_step!(...)

Automatic advanced fast memory usage optimization (of shared memory and registers) can be activated with the keyword argument memopt=true:

@parallel memopt=true function diffusion3D_step!(...)
    #(...)
end
#(...)
@parallel memopt=true diffusion3D_step!(...)

Note that arrays are automatically allocated on the hardware chosen for the computations (GPU or CPU) when using the provided allocation macros:

  • @zeros
  • @ones
  • @rand
  • @falses
  • @trues
  • @fill

Stencil computations with math-close notation

ParallelStencil provides submodules for computing finite differences in 1-D, 2-D and 3-D with a math-close notation (FiniteDifferences1D, FiniteDifferences2D and FiniteDifferences3D). Custom macros to extend the finite differences submodules or for other stencil-based numerical methods can be readily plugged in. The following example shows a complete function for computing a time step of a 3-D heat diffusion solver using FiniteDifferences3D.

#(...)
using ParallelStencil.FiniteDifferences3D
#(...)
@parallel function diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz)
    @inn(T2) = @inn(T) + dt*(lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2));
    return
end

The macros used in this example are described in the Module documentation callable from the Julia REPL / IJulia:

julia> using ParallelStencil.FiniteDifferences3D
julia>?
help?> @inn
  @inn(A): Select the inner elements of A. Corresponds to A[2:end-1,2:end-1,2:end-1].

help?> @d2_xi
  @d2_xi(A): Compute the 2nd order differences between adjacent elements of A along the along dimension x and select the inner elements of A in the remaining dimensions. Corresponds to @inn_yz(@d2_xa(A)).

Note that@d2_yi and @d2_zi perform the analogue operations as @d2_xi along the dimension y and z, respectively.

Type ?FiniteDifferences3D in the Julia REPL to explore all provided macros.

50-lines example deployable on GPU and CPU

This concise 3-D heat diffusion solver uses ParallelStencil and a a simple boolean USE_GPU defines whether it runs on GPU or CPU (the environment variable JULIA_NUM_THREADS defines how many cores are used in the latter case):

const USE_GPU = true
using ParallelStencil
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 3);
else
    @init_parallel_stencil(Threads, Float64, 3);
end

@parallel function diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz)
    @inn(T2) = @inn(T) + dt*(lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2));
    return
end

function diffusion3D()
# Physics
lam        = 1.0;                                        # Thermal conductivity
cp_min     = 1.0;                                        # Minimal heat capacity
lx, ly, lz = 10.0, 10.0, 10.0;                           # Length of domain in dimensions x, y and z.

# Numerics
nx, ny, nz = 256, 256, 256;                              # Number of gridpoints dimensions x, y and z.
nt         = 100;                                        # Number of time steps
dx         = lx/(nx-1);                                  # Space step in x-dimension
dy         = ly/(ny-1);                                  # Space step in y-dimension
dz         = lz/(nz-1);                                  # Space step in z-dimension

# Array initializations
T   = @zeros(nx, ny, nz);
T2  = @zeros(nx, ny, nz);
Ci  = @zeros(nx, ny, nz);

# Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
Ci .= 1.0./( cp_min .+ Data.Array([5*exp(-(((ix-1)*dx-lx/1.5))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) +
                                   5*exp(-(((ix-1)*dx-lx/3.0))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)]) )
T  .= Data.Array([100*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/3.0)/2)^2) +
                   50*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.

# Time loop
dt = min(dx^2,dy^2,dz^2)*cp_min/lam/8.1;                 # Time step for the 3D Heat diffusion
for it = 1:nt
    @parallel diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz);
    T, T2 = T2, T;
end

end

diffusion3D()

The corresponding file can be found here.

50-lines multi-xPU example

This concise multi-xPU 3-D heat diffusion solver uses ParallelStencil in conjunction with ImplicitGlobalGrid.jl and can be readily deployed on on a single CPU thread or on thousands of GPUs/CPUs. Again, a simple boolean USE_GPU defines whether it runs on GPU(s) or CPU(s) (JULIA_NUM_THREADS defines how many cores are used in the latter case). The solver can be run with any number of processes. ImplicitGlobalGrid.jl creates automatically an implicit global computational grid based on the number of processes the solver is run with (and based on the process topology, which can be explicitly chosen by the user or automatically determined). Please refer to the documentation of ImplicitGlobalGrid.jl for more information.

const USE_GPU = true
using ImplicitGlobalGrid
import MPI
using ParallelStencil
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 3);
else
    @init_parallel_stencil(Threads, Float64, 3);
end

@parallel function diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz)
    @inn(T2) = @inn(T) + dt*(lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2));
    return
end

function diffusion3D()
# Physics
lam        = 1.0;                                        # Thermal conductivity
cp_min     = 1.0;                                        # Minimal heat capacity
lx, ly, lz = 10.0, 10.0, 10.0;                           # Length of domain in dimensions x, y and z.

# Numerics
nx, ny, nz = 256, 256, 256;                              # Number of gridpoints dimensions x, y and z.
nt         = 100;                                        # Number of time steps
init_global_grid(nx, ny, nz);
dx         = lx/(nx_g()-1);                              # Space step in x-dimension
dy         = ly/(ny_g()-1);                              # Space step in y-dimension
dz         = lz/(nz_g()-1);                              # Space step in z-dimension

# Array initializations
T   = @zeros(nx, ny, nz);
T2  = @zeros(nx, ny, nz);
Ci  = @zeros(nx, ny, nz);

# Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
Ci .= 1.0./( cp_min .+ Data.Array([5*exp(-((x_g(ix,dx,Ci)-lx/1.5))^2-((y_g(iy,dy,Ci)-ly/2))^2-((z_g(iz,dz,Ci)-lz/1.5))^2) +
                                   5*exp(-((x_g(ix,dx,Ci)-lx/3.0))^2-((y_g(iy,dy,Ci)-ly/2))^2-((z_g(iz,dz,Ci)-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)]) )
T  .= Data.Array([100*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/3.0)/2)^2) +
                   50*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.

# Time loop
dt = min(dx^2,dy^2,dz^2)*cp_min/lam/8.1;                 # Time step for the 3D Heat diffusion
for it = 1:nt
    @parallel diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz);
    update_halo!(T2);
    T, T2 = T2, T;
end

finalize_global_grid();
end

diffusion3D()

The corresponding file can be found here.

Thanks to ImplicitGlobalGrid.jl, only a few function calls had to be added in order to turn the previous single GPU/CPU solver into a multi-xPU solver (omitted unmodified lines are represented with #(...)):

#(...)
using ImplicitGlobalGrid
#(...)
function diffusion3D()
# Physics
#(...)
# Numerics
#(...)
init_global_grid(nx, ny, nz);
dx  = lx/(nx_g()-1);                                     # Space step in x-dimension
dy  = ly/(ny_g()-1);                                     # Space step in y-dimension
dz  = lz/(nz_g()-1);                                     # Space step in z-dimension

# Array initializations
#(...)

# Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
Ci .= 1.0./( cp_min .+ Data.Array([5*exp(-((x_g(ix,dx,Ci)-lx/1.5))^2-((y_g(iy,dy,Ci)-ly/2))^2-((z_g(iz,dz,Ci)-lz/1.5))^2) +
                                   5*exp(-((x_g(ix,dx,Ci)-lx/3.0))^2-((y_g(iy,dy,Ci)-ly/2))^2-((z_g(iz,dz,Ci)-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)]) )
T  .= Data.Array([100*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/3.0)/2)^2) +
                   50*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])

# Time loop
#(...)
for it = 1:nt
    #(...)
    update_halo!(T2);
    #(...)
end

finalize_global_grid();
end

diffusion3D()

Here is the resulting movie when running the application on 8 GPUs, solving 3-D heat diffusion with heterogeneous heat capacity (two Gaussian anomalies) on a global computational grid of size 510x510x510 grid points. It shows the x-z-dimension plane in the middle of the dimension y:

3-D heat diffusion

The corresponding file can be found here.

Seamless interoperability with communication packages and hiding communication

The previous multi-xPU example shows that ParallelStencil is seamlessly interoperable with ImplicitGlobalGrid.jl. The same is a priori true for any communication package that allows to explicitly decide when the required communication occurs; an example is MPI.jl (besides, MPI.jl is also seamlessly interoperable with ImplicitGlobalGrid.jl and can extend its functionality).

Moreover, ParallelStencil enables hiding communication behind computation with as simple call to @hide_communication. In the following example, the communication performed by update_halo! (from the package ImplicitGlobalGrid.jl) is hidden behind the computations done with by @parallel diffusion3D_step!:

@hide_communication (16, 2, 2) begin
    @parallel diffusion3D_step!(T2, Te Ci, lam, dt, dx, dy, dz);
    update_halo!(T2);
end

This enables close to ideal weak scaling of real-world applications on thousands of GPUs/CPUs [1, 2]. Type ?@hide_communication in the Julia REPL to obtain an explanation of the arguments. Profiling a 3-D viscous Stokes flow application using the Nvidia visual profiler (nvvp) graphically exemplifies how the update velocities kernel is split up in boundary and inner domain computations and how the latter overlap with point-to-point MPI communication for halo exchange:

Communication computation overlap ParallelStencil

Support for architecture-agnostic low level kernel programming

High-level stencil computations with math-close notation can be completed with index-based programming using the macro @parallel_indices. For example, the following kernel enables setting Neumann boundary conditions in dimension y (∂A/∂y = 0) when using finite differences in 3-D:

@parallel_indices (ix,iz) function bc_y!(A)
    A[ix,   1, iz] = A[ix,     2, iz]
    A[ix, end, iz] = A[ix, end-1, iz]
    return
end

It can be launched as follows:

@parallel (1:size(A,1), 1:size(A,3)) bc_y!(A)

Furthermore, a set of architecture-agnostic low level kernel language constructs is supported in these @parallel_indices kernels (see in Module documentation callable from the Julia REPL / IJulia). They enable, e.g., explicit usage of shared memory (see this 2-D heat diffusion example).

Support for logical arrays of small arrays / structs

Logical arrays of small arrays / structs enabling optimized data access can be conveniently created with the architecture-agnostic allocation macros earlier introduced (see [Parallelization and optimization with one macro call]). To this purpose, ParallelStencil leverages CellArrays (from CellArrays.jl, which relies in turn on StaticArrays.jl). To create a logical array of small arrays, it is sufficient to pass to any of these allocation macros the keyword celldims with the dimensions of the inner arrays, e.g.:

nx, ny, nz = 128, 128, 128
celldims   = (4, 4)
A = @rand(nx, ny, nz, celldims=celldims)

To create a logical array of structs, a custom cell type can be defined using the macro @CellType and then passed to any of the allocation macros via the keyword celltype, e.g.:

@CellType SymmetricTensor3D fieldnames=(xx, yy, zz, yz, xz, xy)
#(...)
A = @zeros(nx, ny, nz, celltype=SymmetricTensor3D)

Details are found in the Module documentation callable from the Julia REPL / IJulia.

Support for automatic differentiation of architecture-agnostic parallel kernels

Enzyme.jl-powered automatic differentiation of architecture-agnostic parallel kernels is fully integrated into ParallelStencil's API and exposed both with high-level and generic syntax. The keyword argument is enough to trigger a parallel call to the gradient kernel instead of the kernel itself, e.g:

#(...)
@parallel_indices (ix) function f!(A, B, a)
    A[ix] += a * B[ix] * 100.65
    return
end

N = 16
a = 6.5
A = @rand(N)
B = @rand(N)
Ā = @ones(N)
B̄ = @ones(N)

@parallel f!(A, B, a)                 # normal call of f!
@parallel=(A->Ā, B->B̄) f!(A, B, a)  # call to the gradient of f!, differentiated with respect to A and B

Keyword arguments to @parallel allow to customize the automatic differentiation (type ?@parallel to read the corresponding documentation). Automatic differentiation is supported for both @parallel and @parallel_indices kernels.

The generic syntax, which is enabled by the submodule ParallelStencil.AD, provides maximal flexibility. For the above example it looks as follows:

import ParallelStencil.AD
using Enzyme
#(...)
@parallel f!(A, B, a)                                                                                                                 # normal call of f!
@parallel configcall=f!(A, B, a) AD.autodiff_deferred!(Enzyme.Reverse, f!, DuplicatedNoNeed(A, Ā), DuplicatedNoNeed(B, B̄), Const(a))  # call to the gradient of f!, differentiated with respect to A and B

The submodule ParallelStencil.AD contains GPU-compatible wrappers of Enzyme functions (returning always nothing as required by the backend packages CUDA.jl and AMDGPU.jl); the wrapper AD.autodiff_deferred! maps, e.g., Enzyme.autodiff_deferred. The keyword argument configcall makes it trivial to call these generic functions for automatic differentiation with the right launch parameters.

Module documentation callable from the Julia REPL / IJulia

The module documentation can be called from the Julia REPL or in IJulia:

julia> using ParallelStencil
julia>?
help?> ParallelStencil
search: ParallelStencil @init_parallel_stencil

  Module ParallelStencil

  Enables domain scientists to write high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs.

  General overview and examples
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  https://github.com/omlins/ParallelStencil.jl

  Primary macros
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

    •  @init_parallel_stencil

    •  @parallel

    •  @hide_communication

    •  @zeros

    •  @ones

    •  @rand

    •  @falses

    •  @trues

    •  @fill

  │ Advanced
  │
  │    •  @parallel_indices
  │
  │    •  @parallel_async
  │
  │    •  @synchronize

  Macros available for @parallel_indices kernels
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

    •  @ps_show

    •  @ps_println

  │ Advanced
  │
  │    •  @gridDim
  │
  │    •  @blockIdx
  │
  │    •  @blockDim
  │
  │    •  @threadIdx
  │
  │    •  @sync_threads
  │
  │    •  @sharedMem

  Submodules
  ≡≡≡≡≡≡≡≡≡≡≡≡

    •  ParallelStencil.FiniteDifferences1D

    •  ParallelStencil.FiniteDifferences2D

    •  ParallelStencil.FiniteDifferences3D

  Modules generated in caller
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

    •  Data

  To see a description of a macro or module type ?<macroname> (including the @) or ?<modulename>, respectively.

Concise single/multi-xPU miniapps

The miniapps regroup a collection of various 2-D and 3-D codes that leverage ParallelStencil to implement architecture-agnostic high-level code for parallel high-performance stencil computations on GPUs and CPUs. The miniapps target various challenging domain science case studies where multi-physics coupling and important nonlinearities challenge existing solving strategies. In most cases, second order pseudo-transient relaxation delivers implicit solutions of the differential equations. Some miniapps feature a multi-xPU version, which combines the capabilities of ParallelStencil and ImplicitGlobalGrid.jl in order to enable multi-GPU and multi-CPU executions, unleashing Julia-at-scale.

Performance metric

Many-core processors as GPUs are throughput-oriented systems that use their massive parallelism to hide latency. On the scientific application side, most algorithms require only a few operations or flops compared to the amount of numbers or bytes accessed from main memory, and thus are significantly memory bound. The Flop/s metric is no longer the most adequate for reporting the application performance of many modern applications. This status motivated us to develop a memory throughput-based performance evaluation metric, T_eff, to evaluate the performance of iterative stencil-based solvers [1].

The effective memory access, A_eff [GB], is the the sum of twice the memory footprint of the unknown fields, D_u, (fields that depend on their own history and that need to be updated every iteration) and the known fields, D_k, that do not change every iteration. The effective memory access divided by the execution time per iteration, t_it [sec], defines the effective memory throughput, T_eff [GB/s].

Effective memory throughput metric

The upper bound of T_eff is T_peak as measured e.g. by the [7] for CPUs or a GPU analogue. Defining the T_eff metric, we assume that 1) we evaluate an iterative stencil-based solver, 2) the problem size is much larger than the cache sizes and 3) the usage of time blocking is not feasible or advantageous (which is a reasonable assumption for real-world applications). An important concept is not to include fields within the effective memory access that do not depend on their own history (e.g. fluxes); such fields can be re-computed on the fly or stored on-chip. Defining a theoretical upper bound for T_eff that is closer to the real upper bound is work in progress.

Using simple array broadcasting capabilities both with GPU and CPU arrays within Julia does not enable close to optimal performance; ParallelStencil.jl permits to overcome this limitation (see top figure) at similar ease of programming.

Miniapp content

All miniapp codes follow a similar structure and permit serial and threaded CPU as well as Nvidia GPU execution. The first line of each miniapp code permits to enable the CUDA GPU backend upon setting the USE_GPU flag to true.

All the miniapps can be interactively executed within the Julia REPL (this includes the multi-xPU versions when using a single CPU or GPU). Note that for optimal performance the miniapp script of interest <miniapp_code> should be launched from the shell using the project's dependencies --project, disabling array bound checking --check-bounds=no, and using optimization level 3 -O3.

$ julia --project --check-bounds=no -O3 <miniapp_code>.jl

Note: refer to the documentation of your Supercomputing Centre for instructions to run Julia at scale. Instructions for running on the Piz Daint GPU supercomputer at the Swiss National Supercomputing Centre can be found here and for running on the octopus GPU supercomputer at the Swiss Geocomputing Centre can be found here.

Thermo-mechanical convection 2-D app

app: ThermalConvection2D.jl

This thermal convection example in 2-D combines a viscous Stokes flow to advection-diffusion of heat including a temperature-dependent shear viscosity. The miniapp resolves thermal convection cells (e.g. Earth's mantle convection and plumes):

thermal convection 2-D

The gif depicts non-dimensional temperature field as evolving into convection cells and plumes. Results are obtained by running the miniapp ThermalConvection2D.jl.

Viscous Stokes 2-D app

app: Stokes2D.jl

The viscous Stokes flow example solves the incompressible Stokes equations with linear shear rheology in 2-D. The model configuration represents a buoyant inclusion within a less buoyant matrix:

viscous Stokes 2-D

The figure depicts - Upper panels: the dynamical pressure field, the vertical Vy velocity. Lower pannels: the log10 of the vertical momentum balance residual Ry and the log10 of the error norm evolution as function of pseudo-transient iterations. Results are obtained by running the miniapp Stokes2D.jl.

Viscous Stokes 3-D app

apps: Stokes3D.jl, Stokes3D_multixpu.jl

The viscous Stokes flow example solves the incompressible Stokes equations with linear shear rheology in 3-D. The model configuration represents a buoyant spherical inclusion within a less buoyant matrix:

viscous Stokes 3-D multi-xPU

The figure depicts vertically sliced cross-sections of - Upper panels: the dynamical pressure field, the vertical Vz velocity. Lower panels: the log10 of the vertical momentum balance residual Rz and the log10 of the error norm evolution as function of pseudo-transient iterations. The numerical resolution is 252x252x252 grid points in 3-D on 8 GPUs (i.e. a local domain size of 127x127x127 per GPU). The Stokes3D.jl and Stokes3D_multixpu.jl are single- and multi-xPU implementations, respectively. The multi-xPU implementation demonstrates ParallelStencil's capabilities to hide computations behind communication, which is performed with ImplicitGlobalGrid.jl in this case. Results are obtained by running the multi-xPU miniapp Stokes3D_multixpu.jl on 8 Nvidia Titan Xm GPUs distributed across two physically distinct compute nodes.

This multi-xPU application permits to leverage distributed memory parallelisation to enable large-scale 3-D calculations.

Acoustic wave 2-D app

app: acoustic2D.jl

The acoustic wave example solves acoustic waves in 2-D using the split velocity-pressure formulation:

Acoustic wave 2-D

The animation depicts the dynamical pressure field evolution as function of explicit time steps. Results are obtained by running the miniapp acoustic2D.jl.

Acoustic wave 3-D app

apps: acoustic3D.jl, acoustic3D_multixpu.jl

The acoustic wave examples solves acoustic waves in 3-D using the split velocity-pressure formulation:

Acoustic wave 3-D multi-xPU

The animation depicts the y-slice of the dynamical pressure field evolution as function of explicit time steps. The achieved numerical resolution is 1020x1020x1020 grid points in 3-D on 8 GPUs (i.e. a local domain size of 511x511x511 per GPU). The acoustic3D.jl and acoustic3D_multixpu.jl are single- and multi-xPU implementation, respectively. The multi-xPU implementation demonstrates ParallelStencil's capabilities to hide computations behind communication, which is performed with ImplicitGlobalGrid.jl in this case. Results are obtained by running the multi-xPU miniapp acoustic3D_multixpu.jl on 8 Nvidia Titan Xm GPUs distributed across two physically distinct compute nodes.

This multi-xPU application permits to leverage distributed memory parallelisation to enable large-scale 3-D calculations.

Scalar porosity waves 2-D app

app: scalar_porowaves2D.jl

The scalar porosity waves example solves the scalar solitary wave equations in 2-D assuming total pressure to be lithostatic, thus eliminating the need to solve for the total pressure explicitly:

scalar porosity waves 2-D - blobs scalar porosity waves 2-D - channels

The animation depicts the normalised porosity and effective pressure fields evolution as function of explicit time steps. Top row: compaction and decompaction rheology are identical, resulting in circular solitary waves rearranging into solitons of characteristic size. Bottom row: decompaction occurs at much faster rate compared to compaction, resulting in chimney-shaped features.

Hydro-mechanical porosity waves 2-D app

app: HydroMech2D.jl

The hydro-mechanical porosity wave example resolves solitary waves in 2-D owing to hydro-mechanical coupling, removing the lithostatic pressure assumption. The total and fluid pressure are explicitly computed from nonlinear Stokes and Darcy flow solvers, respectively [8]:

Hydro-mechanical porosity waves 2-D

The animation depicts the formation of fluid escape pipes in two-phase media, owing to decompaction weakening running the miniapp HydroMech2D.jl. Top row: evolution of the porosity distribution and effective pressure. Bottom row: Darcy flux (relative fluid to solid motion) and solid (porous matrix) deformation.

Dependencies

ParallelStencil relies on the Julia packages (CUDA.jl [5, 6]), AMDGPU.jl, MacroTools.jl, CellArrays.jl and StaticArrays.jl.

Installation

ParallelStencil may be installed directly with the Julia package manager from the REPL:

julia>]
  pkg> add ParallelStencil
  pkg> test ParallelStencil

Questions, comments and discussions

To discuss technical issues, please post on Julia Discourse in the GPU topic or the Julia at Scale topic or in the #gpu or #distributed channels on the Julia Slack (to join, visit https://julialang.org/slack/). To discuss numerical/domain-science issues, please post on Julia Discourse in the Numerics topic or the Modelling & Simulations topic or whichever other topic fits best your issue.

References

[1] Omlin, S., Räss, L., Kwasniewski, G., Malvoisin, B., & Podladchikov, Y. Y. (2020). Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia. JuliaCon Conference, virtual.

[2] Räss, L., Reuber, G., Omlin, S. (2020). Multi-Physics 3-D Inversion on GPU Supercomputers with Julia. JuliaCon Conference, virtual.

[3] Räss, L., Omlin, S., & Podladchikov, Y. Y. (2019). Porting a Massively Parallel Multi-GPU Application to Julia: a 3-D Nonlinear Multi-Physics Flow Solver. JuliaCon Conference, Baltimore, USA.

[4] Räss, L., Omlin, S., & Podladchikov, Y. Y. (2019). A Nonlinear Multi-Physics 3-D Solver: From CUDA C + MPI to Julia. PASC19 Conference, Zurich, Switzerland.

[5] Besard, T., Foket, C., & De Sutter, B. (2018). Effective Extensible Programming: Unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems, 30(4), 827-841. doi: 10.1109/TPDS.2018.2872064

[6] Besard, T., Churavy, V., Edelman, A., & De Sutter B. (2019). Rapid software prototyping for heterogeneous and distributed platforms. Advances in Engineering Software, 132, 29-46. doi: 10.1016/j.advengsoft.2019.02.002

[7] McCalpin, J. D. (1995). Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.

[8] Räss, L., Duretz, T., & Podladchikov, Y. Y. (2019). Resolving hydromechanical coupling in two and three dimensions: spontaneous channelling of porous fluids owing to de-compaction weakening. Geophysical Journal International, ggz239.

parallelstencil.jl's People

Contributors

albert-de-montserrat avatar luraess avatar omlins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parallelstencil.jl's Issues

Spatially varying parameters?

Hi Ludovic and Samuel,

Hope you stepped into the new year with ease. I have experienced with your julia software in a while, thanks to your clear documentation and workflow!

I have a couple of questions about https://github.com/omlins/ParallelStencil.jl/blob/main/miniapps/HydroMech2D.jl
while I am using the software:

  1. How do we allow the background porosity distribution to be spatially varying? Right now the background porosity is set to be homogenous as and only accepts scalar (since other physical parameters also depend on it, e.g. time step). Same for density, etc.
  2. When simulating with larger background porosity (e.g. 0.2) and larger perturbation (e.g. 0.4), the behavior does not look like chimneys but more like "blobs". Is this behavior expected? Also, do I have to change the physical time stepping and reduction time stepping accordingly?

Thanks a lot for your help!

Kind Regards,

Francis

Support for Complex arrays

Fantastic module.

Will there be (or is there already, somewhere hidden?) support for complex type arrays. I would love to use this to perform simulations on the Nonlinear Schrodinger Equation (specifically Gross-Pitaevskii), but that isn't possible without this feature.
Screenshot from 2021-02-04 18-00-39

sync issues on AMDGPU backend

Running large kernels on AMDGPU backend seem to experience issues with syncing. The behaviour is similar to #107 (comment).

The issue is that AMDGPU.HIPStream constructor is used instead of getting the default stream for the current task AMDGPU.stream().

Passing `struct`s into `@parallel` functions

I have several @parallel functions that each have several arguments (of type Data.Array). To make the code cleaner, I (naively) tried passing a single struct containing all of these arguments. Unfortunately, I get the following error:

ERROR: LoadError: MethodError: no method matching var"@within"(::LineNumberNode, ::Module, ::String, ::Expr)
Closest candidates are:
  var"@within"(::LineNumberNode, ::Module, ::String, ::Symbol) at ~/.julia/packages/ParallelStencil/3flwf/src/FiniteDifferences.jl:153

Fundamentally, I'm assuming this has something to do with how the struct lives in memory and isn't compatible with the the way ParallelStencil distributes Data.Arrays. Is there a way around this? Perhaps there's a more "canonical" way to accomplish this?

Could you also explain the role of the @within macro, so that I can better understand the limitations of @parallel functions (and how to properly use them)?

Thanks for the great package!

Improving loop kernel speed

I have two versions of a relatively simple (2D) finite-difference code, one written in c++, the other using ParallelStencil. The c++ code is consistently 6x-8x faster than the Julia implementation (on a single thread of my M1 cpu). I'm using "cells/second" as my benchmarking metric. It's slightly different than the throughput metric (T) used in the PS publications, but still proportional to the same quantity and a little more meaningful in this context.

I have two theories as to why the Julian code is slower, but I'd like to get some feedback.

  1. branching. First, my c++ implementation has a pretty tight loop. No branching whatsoever. All of the branching is outside the loop (resulting in some significant code repetition). In my PS implementation, I'm using a @parallel_indices kernel that does have an if-statement to ensure the loop indices don't step out of bounds (I'm using a staggered grid, and not all arrays have the same size). I did some preliminary work to refactor the edges of each loop so that I could factor out the branching, but I haven't seen any significant speed improvements. Are there any other case statements hidden within the macros I may not be aware of?

  2. SIMD support. My c++ variant has explicit simd optimizations using some compiler macros. Does PS help the compiler out in this regard? I realize this is tricky with the branching I had earlier... but before I arduously refactor all my code, I want to make sure PS is even simd aware.

Thanks for your help!

Julia 1.6 not supported due to outdated @within signature check

> julia test_parallel.jl 
ERROR: LoadError: LoadError: LoadError: MethodPluginError: the signature of the macro @within is not compatible with ParallelStencil (detected signature: "# 1 method for macro "@within":
[1] var"@within"(__source__::LineNumberNode, __module__::Module, macroname::String, A::Symbol) in ParallelStencil.FiniteDifferences3D at /users/omlins/.julia/dev/ParallelStencil/src/FiniteDifferences.jl:308"). The signature must correspond to the description in ParallelStencil.WITHIN_DOC. See in ParallelStencil.FiniteDifferences{1|2|3}D for examples.
Stacktrace:
  [1] check_mask_macro(caller::Module)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:89
  [2] parallel_kernel(caller::Module, package::Symbol, numbertype::DataType, ndims::Int64, kernel::Expr)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:76
  [3] parallel(caller::Module, args::Expr; package::Symbol)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:43
  [4] parallel(caller::Module, args::Expr)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:40
  [5] var"@parallel"(__source__::LineNumberNode, __module__::Module, args::Vararg{Any, N} where N)
    @ ParallelStencil ~/.julia/dev/ParallelStencil/src/parallel.jl:14
  [6] eval
    @ ./boot.jl:360 [inlined]
  [7] eval(x::Expr)
    @ Base.MainInclude ./client.jl:446
  [8] top-level scope
    @ ~/.julia/dev/ParallelStencil/test/test_parallel.jl:11
  [9] eval(m::Module, e::Any)
    @ Core ./boot.jl:360
 [10] var"@static"(__source__::LineNumberNode, __module__::Module, ex::Any)
    @ Base ./osutils.jl:19

Thread (CPU) Float32/Float64 performance comparison on miniapp acoustic2D

Hi,

I am a bit surprised by the performance comparison of miniapp acoustic2D.jl when I switch from original Float64 to Float32 FP precision.
I use this settings :

 # Numerics
    nx, ny    = 2047, 2047    # numerical grid resolution; should be a mulitple of 32-1 for optimal GPU perf
    nt        = 1000        # number of timesteps
    nout      = 1000          # plotting frequency

On my mac M1 max I obtain (julia 1.8.5 with 8 threads and no boundcheck)

Float64

Animation directory: ./viz2D_out/
Total steps=1000, time=4.920e+00 sec (@ T_eff = 40.00 GB/s) 
[ Info: Saved animation to /Users/laurentplagne/Projects/ParallelStencil.jl/miniapps/acoustic2D.gif

and with Float32

Animation directory: ./viz2D_out/
Total steps=1000, time=4.534e+00 sec (@ T_eff = 22.00 GB/s) 
[ Info: Saved animation to /Users/laurentplagne/Projects/ParallelStencil.jl/miniapps/acoustic2D.gif

So, basically no speed up at all, although I assumed that this case was strongly memory bound and would have expect Float32 to run twice as fast as Float64.

Last, the comment for the Teff computation seems to be deprecated

  A_eff    = (3*2)/1e9*nx*ny*sizeof(Data.Number)  # Effective main memory access per iteration [GB] (Lower bound of required memory access: H and dHdτ have to be read and written (dHdτ for damping): 4 whole-array memaccess; B has to be read: 1 whole-array memaccess)

Any hint ?
P.S. Here is the slithly modified miniapp (with a variable FP)

const USE_GPU = false  # Use GPU? If this is set false, then no GPU needs to be available
using ParallelStencil
using ParallelStencil.FiniteDifferences2D

const FP = Float64

@static if USE_GPU
    @init_parallel_stencil(CUDA, FP, 2)
else
    @init_parallel_stencil(Threads, FP, 2)
end
using Plots, Printf, Statistics

@parallel function compute_V!(Vx::Data.Array, Vy::Data.Array, P::Data.Array, dt::Data.Number, ρ::Data.Number, dx::Data.Number, dy::Data.Number)
    @inn(Vx) = @inn(Vx) - dt/ρ*@d_xi(P)/dx
    @inn(Vy) = @inn(Vy) - dt/ρ*@d_yi(P)/dy
    return
end

@parallel function compute_P!(P::Data.Array, Vx::Data.Array, Vy::Data.Array, dt::Data.Number, k::Data.Number, dx::Data.Number, dy::Data.Number)
    @all(P) = @all(P) - dt*k*(@d_xa(Vx)/dx + @d_ya(Vy)/dy)
    return
end

##################################################
@views function acoustic2D()
    # Physics
    lx, ly    = 40.0, 40.0  # domain extends
    k         = 1.0         # bulk modulus
    ρ         = 1.0         # density
    t         = 0.0         # physical time
    # Numerics
    nx, ny    = 2047, 2047    # numerical grid resolution; should be a mulitple of 32-1 for optimal GPU perf
    nt        = 1000        # number of timesteps
    nout      = 1000          # plotting frequency
    # Derived numerics
    dx, dy    = lx/(nx-1), ly/(ny-1) # cell sizes
    # Array allocations
    P         = @zeros(nx  ,ny  )
    Vx        = @zeros(nx+1,ny  )
    Vy        = @zeros(nx  ,ny+1)
    # Initial conditions
    P        .= Data.Array([exp(-((ix-1)*dx-0.5*lx)^2 -((iy-1)*dy-0.5*ly)^2) for ix=1:size(P,1), iy=1:size(P,2)])
    dt        = min(dx,dy)/sqrt(k/ρ)/4.1
    # Preparation of visualisation
    ENV["GKSwstype"]="nul"; if isdir("viz2D_out")==false mkdir("viz2D_out") end; loadpath = "./viz2D_out/"; anim = Animation(loadpath,String[])
    println("Animation directory: $(anim.dir)")
    X, Y      = -lx/2:dx:lx/2, -ly/2:dy:ly/2
    # Time loop
    for it = 1:nt
        if (it==11)  global wtime0 = Base.time()  end
        @parallel compute_V!(Vx, Vy, P, FP(dt), FP(ρ), FP(dx), FP(dy))
        @parallel compute_P!(P, Vx, Vy, FP(dt), FP(k), FP(dx), FP(dy))
        t = t + dt
        # Visualisation
        if mod(it,nout)==0
            heatmap(X, Y, Array(P)', aspect_ratio=1, xlims=(X[1],X[end]), ylims=(Y[1],Y[end]), c=:viridis, title="Pressure"); frame(anim)
        end
    end
    # Performance
    wtime    = Base.time()-wtime0
    A_eff    = (3*2)/1e9*nx*ny*sizeof(Data.Number)  # Effective main memory access per iteration [GB] (Lower bound of required memory access: H and dHdτ have to be read and written (dHdτ for damping): 4 whole-array memaccess; B has to be read: 1 whole-array memaccess)
    wtime_it = wtime/(nt-10)                        # Execution time per iteration [s]
    T_eff    = A_eff/wtime_it                       # Effective memory throughput [GB/s]
    @printf("Total steps=%d, time=%1.3e sec (@ T_eff = %1.2f GB/s) \n", nt, wtime, round(T_eff, sigdigits=2))
    gif(anim, "acoustic2D.gif", fps = 15)
    return
end

acoustic2D()

GPUCompiler error when running 3D diffusion example on the GPU

While trying out the provided example, I receive the following error (Julia 1.8):

ERROR: MethodError: no method matching return_types(::GPUArrays.var"#5#6", ::Type{Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float64, 3, 1}, Float64}}, ::GPUCompiler.GPUInterpreter)
Closest candidates are:
  return_types(::Any, ::Any; world, interp) at reflection.jl:1294
  return_types(::Any) at reflection.jl:1294

Make CellArrays mutable

I found the documentation on the @CellType macro to be quite interesting and potentially useful for me.

For example, if I am trying to solve a system of tensor equations, then having this ability to store the elements of a symmetric tensor on the grid like the given example @CellType SymmetricTensor2D fieldnames=(xx, zz, xz) would be useful.

But I found that (as far as I can tell from the docs and my testing) there is no way to make the cell type mutable, which is necessary when the tensor makes up the state vector of my system of equations, and thus must be modified at each time step. Since this ultimately comes from StaticArrays.jl, there are mutable types available in there, so this functionality could be added.

Also, it would be nice to have the CellType act as the tensor it is meant to represent, while still only storing the needed values. For example, defining @CellType SymmetricTensor2D fieldnames=(xx, zz, xz) I could then overload getindex

@inline function Base.getindex(A::SymmetricTensor2D,μ::Int64,ν::Int64)
    (μ,ν) == (1,1) && return A.xx
    (μ,ν) == (2,2) && return A.zz
    (μ,ν) == (1,2) && return A.xz
    (μ,ν) == (2,1) && return A.xz
end

I don't know if this is the best way to go about this, although it does work. It might be nice to be able to define an index mapping in @CellType so that general tensors can index however they are needed to.

OOM error when running with more than one mpi rank

Hello,

Thank you for a great package! I am trying one of your examples and it runs fine with one mpi process (never runs out of memory). But fails on an OOM error with more mpi processes (I checked, the devices are less than 1% memory occupied).

With one process, success:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 1 julia --project=@. ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl
Global grid: 127x127x127 (nprocs: 1, dims: 1x1x1)
Animation directory: ./viz3D_out/
Total steps=1000, time=1.636e+01 sec (@ T_eff = 7.90 GB/s)

With 2 processes, failure:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl
Global grid: 252x127x127 (nprocs: 2, dims: 2x1x1)
ERROR: LoadError: Out of GPU memory trying to allocate 15.628 MiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)
Stacktrace:
 [1] #alloc#202
   @./CUDA/mVgLI/src/pool.jl:266 [inlined]
 [2] alloc
   @ ./CUDA/mVgLI/src/pool.jl:258 [inlined]
 [3] CUDA.CuArray{Float64, 3}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64, Int64})
   @ ./CUDA/mVgLI/src/array.jl:28
 [4] CuArray
   @ ./CUDA/mVgLI/src/array.jl:109 [inlined]
 [5] CuArray
   @./CUDA/mVgLI/src/array.jl:110 [inlined]
 [6] zeros
   @ ./CUDA/mVgLI/src/array.jl:409 [inlined]
 [7] acoustic3D()
   @ Main ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl:40
 [8] top-level scope
   @ ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl:86

I appreciate any hints you may have about why this is happening.

Cheers!

Handling boundaries

For fun I was putting together some method of lines demos. Here's ParallelStencil.jl used with Runge-Kutta Chebyshev methods:

const USE_GPU = true
using ParallelStencil, OrdinaryDiffEq
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 3);
else
    @init_parallel_stencil(Threads, Float64, 3);
end

@parallel function diffusion3D_step!(T2, T, Ci, lam, dx, dy, dz)
    @inn(T2) = lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2);
    return
end

function diffusion3D(alg)
    # Physics
    lam        = 1.0;                                        # Thermal conductivity
    cp_min     = 1.0;                                        # Minimal heat capacity
    lx, ly, lz = 10.0, 10.0, 10.0;                           # Length of domain in dimensions x, y and z.

    # Numerics
    nx, ny, nz = 256, 256, 256;                              # Number of gridpoints dimensions x, y and z.
    nt         = 100;                                        # Number of time steps
    dx         = lx/(nx-1);                                  # Space step in x-dimension
    dy         = ly/(ny-1);                                  # Space step in y-dimension
    dz         = lz/(nz-1);                                  # Space step in z-dimension

    # Array initializations
    T   = @zeros(nx, ny, nz);
    T2  = @zeros(nx, ny, nz);
    Ci  = @zeros(nx, ny, nz);

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    Ci .= 1.0./( cp_min .+ Data.Array([5*exp(-(((ix-1)*dx-lx/1.5))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) +
                                       5*exp(-(((ix-1)*dx-lx/3.0))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)]) )
    T  .= Data.Array([100*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/3.0)/2)^2) +
                       50*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
    T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.
    function f(du,u,p,t)
        @show t
        @parallel diffusion3D_step!(du, u, Ci, lam, dx, dy, dz);
    end
    prob = ODEProblem(f, T, (0.0,1.0))
    sol = solve(prob, alg, save_everystep = false, save_start = false)
end

sol = diffusion3D(ROCK2())
using Plots
heatmap(Array(sol[end]))

However, I couldn't figure out the API for specifying BCs. Is there a way to specifically mention the boundary stencil or do you have to just write a loop after the interior calculation?

Error copying Cuda Array to Array when using GPU

Hi!
First I'll thank you all for this amazing module that I just discovered and enjoy so much.
I am very new to Julia and started to play around with some easy GPU computing that offers parallel_stencil.jl, but came across error when running one of the examples provided (examples/diffusion3D_multigpucpu_hidecomm.jl):

    T_nohalo .= T[2:end-1,2:end-1,2:end-1];                                           # Copy data to CPU removing the halo.

where T would be a CUDA Array (selected by parallel_stencil) and T_nohallo a "standard" Array

Causing the following Error :

ERROR: LoadError: This object is not a GPU array
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] backend(#unused#::Type)
@ GPUArrays ~/.julia/packages/GPUArrays/VNhDf/src/device/execution.jl:15
[3] backend(x::Array{Float64, 3})
@ GPUArrays ~/.julia/packages/GPUArrays/VNhDf/src/device/execution.jl:16
[4] _copyto!
@ ~/.julia/packages/GPUArrays/VNhDf/src/host/broadcast.jl:73 [inlined]
[5] materialize!
@ ~/.julia/packages/GPUArrays/VNhDf/src/host/broadcast.jl:51 [inlined]
[6] materialize!(dest::Array{Float64, 3}, bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{3}, Nothing, typeof(identity), Tuple{CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}})
@ Base.Broadcast ./broadcast.jl:868
[7] diffusion3D()
@ Main /.../diffusion3D_multigpucpu_hidecomm.j:60
[8] top-level scope
@ /.../diffusion3D_multigpucpu_hidecomm.jl:84
in expression starting at /.../diffusion3D_multigpucpu_hidecomm.j:84

So from my basic understanding it would appear that parallel_stencil doesn't allow interoperability between CUDA Arrays and Standard ones for broadcasting, is it no longer supported?
Sorry in advance if this is a dumb issue, I have yet to find a workaround.

CUDA 3.0.2 triggers error in REPL

CUDA v3.0.2 fix of

  • Synchronize REPL expressions before returning (#837)

produces now an error within ParallelStencil when running code on the Base.Threads backend in the REPL when CUDA isn't functional (i.e. no libcuda is available).

This if statement addition in CUDA.jl (src/initialization.jl) seems to be responsible for this.

Note that ParallelStencil has CUDA as dependency but won't use it unless activated in @init_parallel_stencil()

Error stack trace:

ERROR: CUDA.jl did not successfully initialize, and is not usable.
If you did not see any other error message, try again in a new session
with the JULIA_DEBUG environment variable set to 'CUDA'.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] libcuda()
    @ CUDA ~/.julia/packages/CUDA/MpASK/src/initialization.jl:52
  [3] macro expansion
    @ ~/.julia/packages/CUDA/MpASK/lib/cudadrv/libcuda.jl:29 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/CUDA/MpASK/lib/cudadrv/error.jl:94 [inlined]
  [5] cuDeviceGet
    @ ~/.julia/packages/CUDA/MpASK/lib/utils/call.jl:26 [inlined]
  [6] CuDevice
    @ ~/.julia/packages/CUDA/MpASK/lib/cudadrv/devices.jl:25 [inlined]
  [7] CUDA.TaskLocalState()
    @ CUDA ~/.julia/packages/CUDA/MpASK/src/state.jl:50
  [8] task_local_state!()
    @ CUDA ~/.julia/packages/CUDA/MpASK/src/state.jl:73
  [9] stream
    @ ~/.julia/packages/CUDA/MpASK/src/state.jl:419 [inlined]
 [10] synchronize()
    @ CUDA ~/.julia/packages/CUDA/MpASK/lib/cudadrv/stream.jl:117
 [11] top-level scope
    @ ~/.julia/packages/CUDA/MpASK/src/initialization.jl:83

Caching the GPU kernels?

After starting Julia, the first usage / compilation of the GPU kernels takes about 30s for me. This is done everytime when I restart Julia.
Is there a possibility that this is cached?
Since Julia 1.9, it seems to me like most compilation is cached, it would be great, if this is possible for this package as well.

`@hide_communication` error when inside a module and using `CUDA.jl` backend

Currently, the @hide_communication macro does not work when using it within a module, e.g. in the context of a package that depends on ParallelStencil (as done for example in AcousticWaveCPML.jl).

Error description

The following error pops up during precompilation of the package that depends on ParallelStencil:

...
ERROR: LoadError: Evaluation into the closed module `ParallelKernel` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `ParallelKernel` with `eval` during precompilation - don't do this.
Stacktrace:
  [1] eval
    @ ./boot.jl:368 [inlined]
  [2] eval
    @ ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/ParallelKernel.jl:37 [inlined]
  [3] extract_args(call::Expr, macroname::Symbol)
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/shared.jl:100
  [4] hide_communication_cuda(boundary_width::Symbol, block::Expr)
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:118
  [5] hide_communication(::Symbol, ::Vararg{Union{Expr, Integer, Symbol}}; package::Symbol)
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:82
  [6] hide_communication(::Symbol, ::Vararg{Union{Expr, Integer, Symbol}})
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:81
  [7] var"@hide_communication"(__source__::LineNumberNode, __module__::Module, args::Vararg{Any})
    @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/3flwf/src/ParallelKernel/hide_communication.jl:68
...

This error only shows up when using @hide_communication inside a module where ParallelStencil was initialized with the CUDA.jl backend.

The cause of this error is the eval call inside the extract_args function that gets called inside the hide_communication_cuda function, currently implemented as:

extract_args(call::Expr, macroname::Symbol) = eval(substitute(deepcopy(call), macroname, Symbol("@get_args")))

This function is only called in the context of hide communication and the macroname is always Symbol("@parallel").

What this function does is replace the @parallel macro of the kernel call expression with the @get_args macro which is defined in the same file as extract_args like the following:

macro get_args(args...) return args end

and then calls eval to evaluate the new macro, hence getting a tuple with the arguments of the @parallel kernel call.
This eval call is problematic since it could change the ParallelStencil.ParallelKernel module from a third-party module while precompilation is occurring. Of course, no actual harm is done here, but Julia is (apparently?) not smart enough to notice this.

How to reproduce the error

Either:

  • clone AcousticWaveCPML.jl. Edit a multi-xPU shared solver (e.g. src/shared/acoustic_2D_multixPU.jl) and uncomment the @hide_communication call (e.g. line 160 and 172 in src/shared/acoustic_2D_multixPU.jl). Try to precompile the package and you should get the error.
  • write a package with a module that uses ParallelStencil inside of it and initialize ParallelStencil with the CUDA.jl backend. Write a @parallel kernel function and use @hide_communication to call that kernel function from within the module. Try to precompile the package and you should get the error.

Possible fixes

A possible fix is to call eval specifying the caller module to be the module where the @hide_communication macro is originally in. This does however require for the @get_args macro to be now accessible in the caller module, which is not in this case because it resides in the ParallelStencil.ParallelKernel module.

Another fix is to just replace the extract_args function with the following:

extract_args(call::Expr) = tuple(expr.args[3:end]...)

since we know that the only possible expression in expr will be a @parallel kernel call, we can extract the arguments of the call just like that without having to rely on eval. Maybe it would be nice to extract these arguments somewhere else (e.g. in the extract_calls function) where it is explicitly checked that expr is a @parallel kernel call.

This latter fix has been tested on both Base.Threads and CUDA.jl backends unit tests currently available in the package.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Heterogenous stencil formulations

TLDR; can PS run multiple distinct kernels on different SMs within a hide_communication block?

One of the biggest advantages to finite-difference-based codes is the ability to add a wide variety of "physics" into the domain by spatially modifying the underlying stencil. For example, different materials may contain additional PDE terms (i.e. additional storage requirements) that are only needed at certain regions in the grid.

Computationally, this means we can define various kernels across the grid that handle the local effects.

Of course, we could just opt to use the most general kernel across the whole grid, but this significantly reduces the potential performance of the code.

Naively, we might just write a series of case statements within our global looping kernel to handle the heterogenous effects. However, extensive branching will severely hinder the performance of the code. Furthermore, it's difficult to handle local memory effects. In other words, we don't want to define large (and computationally dense) arrays who's data is actually quite sparse.

Instead, it's common practice to "chunk" the domain, such that each chunk uses the same stencil (i.e. the same underlying kernel). This way, arrays can be assigned to each chunk, rather than on the global domain.

In addition, GPUs can process different kernels simultaneously (e.g. using multiple SMs). Currently, PS uses this approach to handle the hide_communication feature. In some cases, it even updates the boundaries of the domain within the same timeframe (before a synch is performed).

In the ETHZ course, there is a small section describing a current limitation with this approach:

image

Is it possible to apply other, more complicated kernels here?

typo in error message

@IncoherentCallError("ParallelStencil has already been initialized, with different arguments. If you are using ParallelStencil interactively in the REPL and want to avoid restarting Julia, then you can call ParalllelStencil.@reset_parallel_stencil() and rerun all parts of your code that use ParallelStencil features (including kernel definitions and array allocations). If you are using ParallelStencil non-interactively, then you are using ParallelStencil in an invalid way: @init_parallel_stencil should only be called once, right after 'using ParalllelStencil'.")

ParalllelStencil.@reset_parallel_stencil() should be replaced with ParallelStencil.@reset_parallel_stencil()(Parallel mis-spelled)

Creating modules that depend on `ParallelStencil.jl`

Suppose I wanted to create a module that depends on ParallelStencil.jl (e.g. a self-contained finite-difference code). What's the best way to do this?

Normally this is rather straightforward... but the primary source of ambiguity for me is how to best handle the @init_parallel_stencil() macro.

Naively, I might just call that macro at the top of the script which calls the new module, but it gets hairy when you need to access the created Data class inside of your new module (there's a lot of redundant passing that occurs).

Perhaps a more elegant scenario is to call that macro within the module itself (but then you have a chicken-egg problem when trying to specify the parameters of that macro...)?

I'm curious to see how others tackle this.

Issues with backpropagation and adjoint construction

I couldn't get either methods for adjoint generation working over the @parallel stencil. For pure Zygote-based VJP calculations,

@parallel function diffusion3D_step!(T2, T, Ci, lam, dx, dy, dz)
    @inn(T2) = lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2);
    return
end

will fail because of the mutation. I was wondering if you were planning to support adjoint rules on @parallel constructors for this. That would be required for GPU usage.

Trying to avoid that issue, using ReverseDiffVJPs had an issue highlighted by the MWE:

const USE_GPU = false
using ParallelStencil, OrdinaryDiffEq
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 3);
else
    @init_parallel_stencil(Threads, Float64, 3);
end

@parallel function diffusion3D_step!(T2, T, Ci, lam, dx, dy, dz)
    @inn(T2) = lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2);
    return
end

function diffusion3D(lam,alg)
    # Physics
    cp_min     = 1.0;                                        # Minimal heat capacity
    lx, ly, lz = 10.0, 10.0, 10.0;                           # Length of domain in dimensions x, y and z.

    # Numerics
    nx, ny, nz = 16, 16, 16;                              # Number of gridpoints dimensions x, y and z.
    nt         = 100;                                        # Number of time steps
    dx         = lx/(nx-1);                                  # Space step in x-dimension
    dy         = ly/(ny-1);                                  # Space step in y-dimension
    dz         = lz/(nz-1);                                  # Space step in z-dimension

    # Array initializations
    T   = @zeros(nx, ny, nz);
    T2  = @zeros(nx, ny, nz);
    Ci  = @zeros(nx, ny, nz);

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    Ci .= 1.0./( cp_min .+ Data.Array([5*exp(-(((ix-1)*dx-lx/1.5))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) +
                                       5*exp(-(((ix-1)*dx-lx/3.0))^2-(((iy-1)*dy-ly/2))^2-(((iz-1)*dz-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)]) )
    T  .= Data.Array([100*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/3.0)/2)^2) +
                       50*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2-(((iz-1)*dz-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
    T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.

    dt = min(dx^2,dy^2,dz^2)*cp_min/8.1;                 # Time step for the 3D Heat diffusion
    function f(du,u,p,t)
        @show t
        @parallel diffusion3D_step!(du, u, Ci, p[1], dx, dy, dz);
    end
    prob = ODEProblem(f, T, (0.0,nt*dt), lam)
    sol = solve(prob, alg, save_everystep = false, save_start = false)
end

sol = diffusion3D([1.0],ROCK2())

using ForwardDiff, Zygote, DiffEqSensitivity
function loss(p)
    sum(diffusion3D(p,ROCK2()))
end
ForwardDiff.gradient(loss,[1.0])
Zygote.gradient(loss,[1.0])
TaskFailedException

    nested task error: BoundsError: attempt to access 22080-element Vector{ReverseDiff.AbstractInstruction} at index [474]
    Stacktrace:
      [1] setindex!
        @ .\array.jl:839 [inlined]
      [2] push!
        @ .\array.jl:930 [inlined]
      [3] record!
        @ C:\Users\accou\.julia\packages\ReverseDiff\E4Tzn\src\tape.jl:10 [inlined]
      [4] ForwardOptimize
        @ C:\Users\accou\.julia\packages\ReverseDiff\E4Tzn\src\macros.jl:105 [inlined]
      [5] -
        @ C:\Users\accou\.julia\packages\ReverseDiff\E4Tzn\src\derivatives\scalars.jl:9 [inlined]
      [6] macro expansion
        @ C:\Users\accou\.julia\packages\ParallelStencil\0VULM\src\parallel.jl:103 [inlined]
      [7] macro expansion
        @ D:\OneDrive\Computer\Desktop\test.jl:11 [inlined]
      [8] macro expansion
        @ C:\Users\accou\.julia\packages\ParallelStencil\0VULM\src\ParallelKernel\parallel.jl:330 [inlined]
      [9] (::var"#567#threadsfor_fun#7"{Array{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}}, 3}, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, Array{Float64, 3}, ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}, Float64, Float64, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, UnitRange{Int64}})(onethread::Bool)
        @ Main .\threadingconstructs.jl:81
     [10] (::var"#567#threadsfor_fun#7"{Array{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}}, 3}, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, Array{Float64, 3}, ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}, Float64, Float64, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, UnitRange{Int64}})()
        @ Main .\threadingconstructs.jl:48
wait at task.jl:322 [inlined]
threading_run(func::Function) at threadingconstructs.jl:34
macro expansion at threadingconstructs.jl:93 [inlined]
macro expansion at parallel.jl:328 [inlined]
diffusion3D_step! at shared.jl:77 [inlined]
(::var"#f#12"{Array{Float64, 3}, Float64, Float64, Float64})(du::Array{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}}, 3}, u::ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, p::ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, t::ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}) at test.jl:42
ODEFunction at scimlfunctions.jl:334 [inlined]
(::DiffEqSensitivity.var"#33#37"{ODEFunction{true, var"#f#12"{Array{Float64, 3}, Float64, Float64, Float64}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}})(u::ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, p::ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, t::ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}) at derivative_wrappers.jl:345
ReverseDiff.GradientTape(f::Function, input::Tuple{Array{Float64, 3}, Vector{Float64}, Vector{Float64}}, cfg::ReverseDiff.GradientConfig{Tuple{ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}}) at tape.jl:207
GradientTape at tape.jl:204 [inlined]
_vecjacobian!(dλ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, y::Array{Float64, 3}, λ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, p::Vector{Float64}, t::Float64, S::DiffEqSensitivity.ODEInterpolatingAdjointSensitivityFunction{DiffEqSensitivity.AdjointDiffCache{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, ReverseDiff.GradientTape{DiffEqSensitivity.var"#93#104"{ODEFunction{true, var"#f#12"{Array{Float64, 3}, Float64, Float64, Float64}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}}, Tuple{ReverseDiff.TrackedArray{Float64, Float64, 3, Array{Float64, 3}, Array{Float64, 3}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}, Vector{ReverseDiff.TrackedReal{Float64, Float64, ReverseDiff.TrackedArray{Float64, Float64, 1, Vector{Float64}, Vector{Float64}}}}}, Nothing, Noth...

while forward-mode runs, it looks like the tracing required for adjoints causes issues with the threading constructs.

Disable subnormals inside @parallel blocks

I'm able to disable subnormals (e.g. set_zero_subnormals(true)) within @parallel_indices blocks, but not parallel blocks. Is there an easy way to get around this? Thanks!

Question about @inn macro

I am relatively new to metaprogramming in Julia so this may simply be a personal issue.

With that acknowledged, I am struggling to understand how the @inn macro in FiniteDifferences.jl behaves as advertised. The documentation defines @doc "@inn(A): Select the inner elements of A. Corresponds to A[2:end-1]" :(@inn). The relevant definitions in the file include
const ix = INDICES[1]
const ixi = :($ix+1)
...
macro all(A::Symbol) esc(:( $A[$ix ] )) end
macro inn(A::Symbol) esc(:( $A[$ixi ] )) end
From this definition, it seems that @inn would simply shift all of the indices of A::Symbol by one value (resulting in an out of bounds error). Considering that all of the miniapps function as intended, I know I must be missing something but I can't for the life of me figure out how it works.

Debugging and profiling workflow

Is there a recommended workflow when trying to debug and profile @parallel_indices functions?

Often, my functions reference several other functions (which typically act as function barriers). If I want to profile these nested functions, or even the parent function, I run into various problems due to the way the macro extracts out code.

For example, it would be great if I could use something like Cthulhu.jl to recursively evaluate the @code_warntype output (and check for dynamic dispatching etc.) but this currently isn't possible. More importantly, none of the current Julia debuggers can traverse into any of the kernels. In other words, I can't seem to find a way to inspect the current running state of the kernel (which is also important for checking for dynamic dispatch and specialization errors).

I could copy the body contents into a "custom" function that implements a manual loop over the indices... but this seems a little too brute force.

How to implement custom finite differencing operators

I am interested in using this package to multithread some hyperbolic PDEs starting from the example of the 1D acoustic wave equation. The first step is to figure out how to implement my own finite difference operators (I use summation by parts operators that are not diagonal for example)

The documentation states that "Custom macros to extend the finite differences submodules or for other stencil-based numerical methods can be readily plugged in"

It is not obvious to me how to do this. Say I have a non-diagonal but sparse matrix D that I want to apply to the current state vector, how can I write a macro to do this?

Add device_sync

JuliaGPU switching to task local state, synchronize only syncs default stream on current task. This ma lead to conflicts between array programming operations executed on defaults and kernel programming executed on custom stream. Adding support for (heavy) device sync may be needed in some cases:
CUDA: CUDA.device_synchronize()
AMDGPU: AMDGPU.HIP.devide_synchronize()

CUDA Crash with julia 1.9.0

Hi,
I tried to run acoustic3D.jl miniapp using CUDA (USE_GPU = true).
Everything is fine with julia 1.8.5 but crash with julia 1.9

julia> include("acoustic3D.jl")
┌ Warning: ParallelStencil has already been initialized, with the same arguments. If you are using ParallelStencil interactively in the REPL, then you can ignore this message. If you are using ParallelStencil non-interactively, then you are likely using ParallelStencil in an inconsistent way: @init_parallel_stencil should only be called once, right after 'using ParallelStencil'.
└ @ ParallelStencil ~/.julia/packages/ParallelStencil/fQa5L/src/init_parallel_stencil.jl:73
┌ Warning: Module Data from previous module initialization found in caller module (Main); module Data not created. If you are working interactively in the REPL, then you can ignore this message.
└ @ ParallelStencil.ParallelKernel ~/.julia/packages/ParallelStencil/fQa5L/src/ParallelKernel/init_parallel_kernel.jl:33
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemAllocAsync(dptr::Base.RefValue{CUDA.CuPtr{Nothing}}, bytesize::Int64, hStream::CUDA.CuStream)
    @ CUDA ~/.julia/packages/CUDA/BbliS/lib/utils/call.jl:26
  [4] #alloc#1

...

(test_pstencil) pkg> status
Status `~/temp/test_pstencil/Project.toml`
⌅ [052768ef] CUDA v3.13.1
  [94395366] ParallelStencil v0.6.1
  [91a5bcdd] Plots v1.38.11
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, goldmont)
  Threads: 8 on 32 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
  JULIA_IMAGE_THREADS = 1

julia> using CUDA

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 530.41.3, for CUDA 12.1
CUDA driver 12.1

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 12.0.0+530.41.3
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.9.0
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 4070 (sm_89, 9.942 GiB / 11.994 GiB available)

julia> Unhandled Task ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:

Non cartesian gather!

Hi,

I have an array quantity a computed on every process and I need to gather all the aon the process 0.
Can I do this gathering operation with ParallelStencil.jl or should I switch to plain MPI programming ?

Thank you for your help.

@parallel_indices exception in package precompilation

Hey there!

Having a great time using this package, but I have started to include ParallelStencils into a package that I am writing, and that involves tagging some functions with @parallel_indices. During precompilation to test an example I have written it errors out as I obviously haven't created a stencil yet. What's curious is that functions tagged with @parallel don't cause the same error.

Is there anything I need to declare/ have missed within my own package to stop this error?

Apologies if this is a daft question, I haven't got a lot of experience with GPU stuff!

Cheers,

Dan

Optimal Data Layout

Congrats and many thanks for your super impressive and useful package !

I have a few questions :

  • Is there a doc/videos/forum threads explaining the internal design choices ?

  • I was wondering about the internal data layout of fields: are they organized to adapt to the cache hierarchy and SIMD width of computing targets ? Do you think it would make sense ?

  • The README states that time blocking should not be interesting for real applications. Could you elaborate ? I thought that small blocks with halo may perform several time-steps before communication and may help to reduce the memory bandwidth pressure. Do you think it would make sense ?

Best,

Laurent

AMDGPU v0.5.0 compat

Since AMDGPU v0.5.0 gridsize represents the number of "workgroups" (or blocks in CUDA) and no longer "workitems * workgroups" (or threads * blocks in CUDA) as HIP is used for kernel launches instead of HSA.

This means that previous AMDGPU kernel launch param gridsize need to be adapted from gridsize = nx to gridsize = cld(nx, groupsize).

Also, queues and signals are now abstracted into streams.

3d image application

Hello I am trying to apply this library to 3d medical imaging, as it seems ideal yet I have problems with application of code to my domain.
For the beginning I would like to achieve very simple thing just calculate the mean and standard deviation of neighbouring pixels so pseudo code would look like.

# dimensionality increases as input volume holds just intensity , when output volume holds array with mean and standard deviation
function getMeanAndStd ( data ::Array{Int32, 3} ) :: Array{Int32,4}

  out = Array{Int32,4} # preallocating output

   # for example i would use Neumann style stencil and in this naive idea Cartesian index of the center of the stencil
  for ((nsten, cartIndex) in data) 

    flattened = flaten(nsten)
    out[cartIndex] = [ mean(flattened), std(flattened) ] # putting the result in proper place of output

  end
  return out
end

Of course In case of big images i would need to batch the data as I could have a problem in fitting it in GPU memory also as seen from output here further analysis would be in 4d .

I just wanted to ask for some guide How to achieve what I had written above so hopefully I will be able to go on further on my own and later share effects of the work with scientific community.

Thanks for Help!

Code generation for multiple stencils

Often, a particular stencil-based code can contain dozens or even hundreds of variations of the most "complete" stencil. For example, depending on what parameters the simulation is running, it may only need certain aspects of the most complete case.

In C/C++, many codes use some sort of code generation/ metaprogramming to handle the combinatorial explosion that occurs in these cases.

Has anyone explored @parallel kernel generation in this context? I tried creating a @generated function from a @parallel construct, but was running into some issues.

Just curious if others have thought of this. Thanks!

Efficient cache-aware and cache-oblivious implementations

There's a great example, diffusion2D_shmem_novis.jl, which shows how to leverage the shared memory of a GPU (thereby taking advantage of spatial locality common with so many stencil codes).

It would be great if we could abstract this slightly such that we could implement cache-aware algorithms on the CPU using the same kernel code as the GPU. In essence, one divides the domain into chunks (or blocks, on a GPU) and then loops over these chunks assigning workers as needed. Unfortunately, one doesn't have direct access to the L1 cache within a CPU (otherwise it wouldn't be called a cache). However it's well known that by tiling the arrays such that they fit into cache, the compiler will keep the data in cache, allowing you to leverage spatial locality (as is often done with CPU-based tiled matrix multiplication).

Currently, the CPU loops over individual indices... so even if someone manually chunked the domain, there would be quite a lot of cache contention. Instead, it would be better to loop over tiles in this case. (there's a bit of extra working ensuring that the same kernel code could be used on a CPU and a GPU because of the extra worker hierarchy within GPUs).

Even better would be an abstract interface to a cache-oblivious implementation that "just works" on CPUs and GPUs (i.e. a recursive technique could be used to automatically detect cache sizes). Cache-oblivious algorithms are great for a number of reasons besides portability. Namely, they tend to leverage all levels of the cache (L2, L3, etc) better than cache-aware algorithms.

I realize this is a somewhat grandiose vision... I'm just throwing it out there for people to toy with 🙂

Test against cudnn?

I'm curious what GPU performance you get against something like the cudnn wrappers of NNlibCUDA.jl. Those would be more appropriate comparisons than CuArray broadcasting.

dimensionality restrictions

I'm considering to use this package for a PDE formulation where the grid is in a four dimensional space, therefore, I'd like to ask how hard is the dimensionality restriction (at least according to the readme) of 1,2 or 3D present in the code?

[New Miniapp] Time-varying pressure source for acoustic2D

Not sure if you want to avoid domain specific miniapps, but I am working towards modifying the acoustic 2D miniapp for ultrasound and photoacoustic modeling. Currently I just played around with getting a time-varying source working.

My current implementation just sets the pressure at a given pixel based on some source signal vector.

freq = 1/dt/5 # divide by 5 to stay below the nyquist limit of the grid's sampling frequency
trange = range(0,2pi,length=nt)
amp = 1
signal = amp*sin.(freq*trange)

# Time loop
for it = 1:nt
        P[nx÷2,ny÷2]=signal[it] # Set source signal value
        @parallel compute_V!(Vx, Vy, P, dt, ρ, dx, dy)
        @parallel compute_P!(P, Vx, Vy, dt, k, dx, dy)
        
        # Visualisation
 end

I can clean my code up a bit and submit it as an acoustic2D with time varying source miniapp, but I wanted to see if this would be a recommended way of implementing a time-varying source? Or if this poses obvious performance penalties that users shouldn't be exposed to.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.