Giter Site home page Giter Site logo

eth-cscs / cosma Goto Github PK

View Code? Open in Web Editor NEW
185.0 21.0 27.0 8.56 MB

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm

License: BSD 3-Clause "New" or "Revised" License

CMake 7.54% C++ 86.03% Shell 1.12% Python 0.94% C 2.33% Dockerfile 2.04%
matrix-multiplication mpi linear-algebra gpu-acceleration cuda communication-optimal rocm pdgemm scalapack matmul

cosma's Introduction

pipeline status

Table of Contents


COSMA is a parallel, high-performance, GPU-accelerated, matrix-matrix multiplication algorithm that is communication-optimal for all combinations of matrix dimensions, number of processors and memory sizes, without the need for any parameter tuning. The key idea behind COSMA is to first derive a tight optimal sequential schedule and only then parallelize it, preserving I/O optimality between processes. This stands in contrast with the 2D and 3D algorithms, which fix process domain decomposition upfront and then map it to the matrix dimensions, which may result in asymptotically more communication. The final design of COSMA facilitates the overlap of computation and communication, ensuring speedups and applicability of modern mechanisms such as RDMA. COSMA allows to not utilize some processors in order to optimize the processor grid, which reduces the communication volume even further and increases the computation volume per processor.

COSMA got the Best Student Paper Award at the prestigious Supercomputing 2019 conference in Denver, US.

COSMA alleviates the issues of current state-of-the-art algorithms, which can be summarized as follows:

  • 2D (SUMMA): Requires manual tuning and not communication-optimal in the presence of extra memory.
  • 2.5D: Optimal for m=n, but inefficient for m << n or n << m and for some numbers of processes p.
  • Recursive (CARMA): Asymptotically communication-optimal for all m, n, k, p, but splitting always the largest dimension might lead up to โˆš3 increase in communication volume.
  • COSMA (this work): Strictly communication-optimal (not just asymptotically) for all m, n, k, p and memory sizes that yields the speedups by factor of up to 8.3x over the second-fastest algorithm.

In addition to being communication-optimal, this implementation is higly-optimized to reduce the memory footprint in the following sense:

  • Buffer Reuse: all the buffers are pre-allocated and carefully reused during execution, including the buffers necessary for the communication, which reduces the total memory usage.
  • Reduced Local Data Movement: the assignment of data blocks to processes is fully adapted to communication pattern, which minimizes the need of local data reshuffling that arise after each communication step.

The library supports both one-sided and two-sided MPI communication backends. It uses dgemm for the local computations, but also has a support for the GPU acceleration through our Tiled-MM library using cublas or rocBLAS.

COSMA Literature

The paper and other materials on COSMA are available under the following link:


  • [NEW] Multi-GPU Systems Support: COSMA is now able to take advantage of fast GPU-to-GPU interconnects either through the use of NCCL/RCCL libraries or by using the GPU-aware MPI. Both, NVIDIA and AMD GPUs are supported.
  • ScaLAPACK API Support: it is enough to link to COSMA, without changing the code and all p?gemm calls will use ScaLAPACK wrappers provided by COSMA.
  • C/Fortran Interface: written in C++, but provides C and Fortran interfaces.
  • Custom Types: fully templatized types.
  • GPU acceleration: supports both NVIDIA and AMD GPUs.
  • Supported BLAS (CPU) backends: MKL, LibSci, NETLIB, BLIS, ATLAS.
  • Custom Data Layout Support: natively uses its own blocked data layout of matrices, but supports arbitrary grid-like data layout of matrices.
  • Tranposition/Conjugation Support: matrices A and B can be transposed and/or conjugated.
  • Communication and Computation Overlap: supports overlapping of communication and computation.
  • Spack Installation: can be built and installed with Spack since v14.1
  • Julia Package: see on how to use COSMA in the Julia language.

Building COSMA

See Installation Instructions.

COSMA Dependencies

COSMA is a CMake project and requires a recent CMake(>=3.17).

External dependencies:

  • MPI 3: (required)
  • BLAS: when the problem becomes local, COSMA uses provided ?gemm backend, which can be one of the following:
    • MKL (default)
    • BLIS
    • ATLAS
    • CRAY_LIBSCI: Cray-libsci or Cray-libsci_acc (GPU-accelerated)
    • CUDA: cublas is used for NVIDIA GPUs
    • ROCM: rocBLAS is used for AMD GPUs
    • CUSTOM: user-provided BLAS API

Some dependencies are bundled as submodules and need not be installed explicitly:

  • TiledMM - cublasXt GEMM replacement, that is also ported to AMD GPUs.
  • COSTA - distributed matrix reshuffle and transpose algorithm.
  • semiprof - profiling utlility
  • gtest_mpi - MPI utlility wrapper over GoogleTest (unit testing library)


To allow easy integration, COSMA can be used in the following ways:

  • without changing your code: if your code already uses the ScaLAPACK API, then you can just link to COSMA, before linking to any other library providing pxgemm and all pxgemm calls will be using COSMA, without the need to change your code at all. To get a feeling of the performance you can expect to get, please have a look at the pdgemm miniapp. To see how you can link your code to COSMA pxgemm, have a look at the 30 seconds tutorial on how to do this. In this way, we integrated COSMA into CP2K quantum chemistry simulator, which you can read more about in the production example.

  • adapting your code: if your code is not using ScaLAPACK, then there are two interfaces that can be used:

    • custom layout: if you matrices are distributed in a custom way, then it is eanough to pass the descriptors of your data layout to multiply_using_layout function, which will then adapt COSMA to your own layout.
    • native COSMA layout: to get the maximum performance, the native COSMA matrix layout should be used. To get an idea of the performance you can expect to get, please have a look at the matrix multiplication miniapp.

The documentation for the latter option will soon be published here.

Using COSMA in 30 seconds

For easy integration, it is enough to build COSMA with ScaLAPACK API and then link your code to COSMA before linking to any other library providing ScaLAPACK pxgemm. This way, all pxgemm calls will be using COSMA pxgemm wrappers. To achieve this, please follow these steps:

  1. Build COSMA with ScaLAPACK API:
# get COSMA
git clone --recursive cosma && cd cosma

# build and install COSMA
mkdir build && cd build

# set up the compiler, e.g. with:
export CC=`which cc`
export CXX=`which CC`

# choose BLAS and SCALAPACK versions you want to use
make -j 8
make install

!! Note the --recursive flag !!

  1. Link your code to COSMA:

    • CPU-only version of COSMA:

      • link your code to:

      -L/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack

      • then link to the BLAS and ScaLAPACK you built COSMA with (see COSMA_BLAS and COSMA_SCALAPACK flags in cmake):

      -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm

    • using GPU-accelerated version of COSMA:

      • link your code to:

      -L/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack -lTiled-MM

      • link to the GPU backend you built COSMA with (see COSMA_BLAS flag in cmake):

      -lcublas -lcudart -lrt

      • then link to the ScaLAPACK you built COSMA with (see COSMA_SCALAPACK flag in cmake):

      -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm

  2. Include headers:


COSMA on Multi-GPU Systems

COSMA is able to take advantage of fast GPU-to-GPU interconnects on multi-gpu systems. This can be achieved in one of the following ways.

Using NCCL/RCCL Libraries

When running cmake for COSMA, make sure to specify -DCOSMA_WITH_NCCL=ON, e.g. by doing:

    # this will looks for NCCL library in the following environment variables:
    # - NCCL_ROOT: Base directory where all NCCL components are found
    # - NCCL_INCLUDE_DIR: Directory where NCCL header is found
    # - NCCL_LIB_DIR: Directory where NCCL library is found

    # AMD GPUs
    # this will looks for RCCL library in the following environment variables:
    # - RCCL_ROOT_DIR: Base directory where all RCCL components are found
    # - RCCL_INCLUDE_DIR: Directory where RCCL header is found
    # - RCCL_LIB_DIR: Directory where RCCL library is found

Using GPU-aware MPI

When running cmake for COSMA, make sure that GPU-aware MPI is enabled in your environment and specify -DCOSMA_WITH_GPU_AWARE_MPI=ON when running cmake for COSMA, e.g. by doing:

    # Before running cmake, make sure that GPU-aware MPI is enabled on your system.
    # For example, on Cray-systems, this can be done by setting the following environment variables:
    # - export MPICH_RDMA_ENABLED_CUDA=1

COSMA in Production


COSMA is integrated into the CP2K quantum chemistry simulator. Since COSMA provides ScaLAPACK API, it is enough to link CP2K to COSMA, without changing CP2K code at all, which makes the integration trivial even if (as in the case of CP2K) the simulation code is in written Fortran.

In the production run, we ran Random-Phase Approximation (RPA) benchmark of 128 water molecules, using the Resolution of Identity (RI). The benchmark was run once on 1024 and once on 128 nodes of the GPU partition on Piz Daint supercomputer (Cray XC50). Computationally, the most dominant part of this benchmark consists of 46 tall-and-skinny dense matrix multiplications, with the parameters shown in the table below:

On 1024 nodes, we compared the performance of CP2K using COSMA and Cray-libsci_acc (version: 19.10.1), both being GPU accelerated, for all dense matrix-matrix multiplications (pdgemm routine). As can be seen in the following table, the version with COSMA was approximately 2x faster.

On 128 nodes, we compared the performance of CP2K using the following algorithms for multiplying matrices (pdgemm routine): MKL (version:, Cray-libsci (version: 19.06.1), Cray-libsci_acc (version: 19.10.1, GPU accelerated) and COSMA (both CPU-only and GPU-accelerated versions) libraries. The version with COSMA was the fastest on both CPU and GPU. The CPU version of COSMA achieved the peak performance, whereas the GPU version achieved more than 65% of the peak performance of GPUs. Keep in mind that the peak performance of GPUs assumes the data is already residing on GPUs which is not the case here, since matrices were initially residing on CPU. This is one of the reasons why the peak performance is not achieved with the GPU version. Still, the GPU version of COSMA was 25-27% faster than the second best in this case. The results are summarized in the following table:

With COSMA, even higher speedups are possible, depending on matrix shapes. To illustrate possible performance gains, we also ran different square matrix multiplications on the same number of nodes (=128) of Piz Daint supercomputer. The block size is 128x128 and the processor grid is also square: 16x16 (2 ranks per node). The performance of COSMA is compared against Intel MKL ScaLAPACK (version: The results on Cray XC50 (GPU-accelerated) and Cray XC40 (CPU-only) are summarized in the following table:

All the results from this section assumed matrices given in (block-cyclic) ScaLAPACK data layout. However, if the native COSMA layout is used, even higher throughput is possible.

Julia language

The COSMA.jl Julia package uses COSMA's C-interface to provide COSMA-based matrix-matrix multiplication for the DistributedArrays.jl package. A minimal working example to multiply two random matrices looks as follows:

using MPIClusterManager, DistributedArrays, Distributed

manager = MPIManager(np = 6)

@everywhere using COSMA

A = drand(8000, 8000) * drand(8000, 8000)


# for CPU-only version
# for Hybrid (CPU+GPU) version

The script will use SLURM to submit a job on 10 nodes. The job will run 2 matrix multiplications and output the time COSMA algorithm took.

Matrix Multiplication

The project contains a miniapp that produces two random matrices A and B, computes their product C with the COSMA algorithm and outputs the time of the multiplication.

The miniapp consists of an executable ./build/miniapp/cosma_miniapp which can be run with the following command line (assuming we are in the root folder of the project):

# set the number of threads to be used by each MPI rank
# if using CPU version with MKL backend, set MKL_NUM_THREADS as well
# run the miniapp
mpirun -np 4 ./build/miniapp/cosma_miniapp -m 1000 -n 1000 -k 1000 -r 2

The overview of all supported options is given below:

  • -m (--m_dim) (default: 1000): number of rows of matrices A and C.
  • -n (--n_dim) (default: 1000): number of columns of matrices B and C.
  • -k (--k_dim) (default: 1000): number of columns of matrix A and rows of matrix B.
  • -s (--steps) (optional): string of triplets divided by comma defining the splitting strategy. Each triplet defines one step of the algorithm. The first character in the triplet defines whether it is a parallel (p) or a sequential (s) step. The second character defines the dimension that is splitted in this step. The third parameter is an integer which defines the divisor. This parameter can be omitted. In that case the default strategy will be used. An example of a possible value for the upper example: --steps=sm2,pn2,pk2.
  • -r (--n_rep) (optional, default: 2): the number of repetitions.
  • -t (--type) (optional, default: double): data type of matrix entries. Can be one of: float, double, zfloat and zdouble. The last two correspond to complex numbers.
  • --test (optional): if present, the result of COSMA will be verified with the result of the available SCALAPACK.
  • -h (--help) (optional): print available options.

COSMA pxgemm wrapper

COSMA also contains a wrapper for ScaLAPACK pxgemm calls which offers scalapack interface (pxgemm functions with exactly the same signatures as ScaLAPACK). Running these functions will take care of transforming the matrices between ScaLAPACK and COSMA data layout, perform the multiplication using COSMA algorithm and transform the result back to the specified ScaLAPACK data layout.

The miniapp consists of an executable ./build/miniapp/pxgemm_miniapp which can be run as follows (assuming we are in the root folder of the project):

# set the number of threads to be used by each MPI rank
# if using CPU version with MKL backend, set MKL_NUM_THREADS as well
# run the miniapp
mpirun -np 4 ./build/miniapp/pxgemm_miniapp -m 1000 -n 1000 -k 1000 \
                                            --block_a=128,128 \
                                            --block_b=128,128 \
                                            --block_c=128,128 \
                                            --p_grid=2,2 \
                                            --transpose=NN \
                                            --type=double \

The overview of all supported options is given below:

  • -m (--m_dim) (default: 1000): number of rows of matrices A and C.
  • -n (--n_dim) (default: 1000): number of columns of matrices B and C.
  • -k (--k_dim) (default: 1000): number of columns of matrix A and rows of matrix B.
  • --block_a (optional, default: 128,128): 2D-block size for matrix A.
  • --block_b (optional, default 128,128): 2D-block size for matrix B.
  • --block_c (optional, default 128,128): 2D-block size for matrix C.
  • -p (--p_grid) (optional, default: 1,P): 2D-processor grid. By default 1xP where P is the total number of MPI ranks.
  • --transpose (optional, default: NN): transpose/conjugate flags to A and B.
  • --alpha (optional, default: 1): alpha parameter in C = alpha*A*B + beta*C.
  • --beta (optional, default: 0): beta parameter in C = alpha*A*B + beta*C.
  • -r (--n_rep) (optional, default: 2): number of repetitions.
  • -t (--type) (optional, default: double): data type of matrix entries. Can be one of: float, double, zfloat and zdouble. The last two correspond to complex numbers.
  • --test (optional): if present, the result of COSMA will be verified with the result of the available SCALAPACK.
  • --algorithm (optional, default: both): defines which algorithm (cosma, scalapack or both) to run.
  • -h (--help) (optional): print available options.

Tunable Parameters

Parameters Overview

The overview of tunable parameters, that can be set through environment variables is given in the table below. The default values are given in bold.

COSMA_OVERLAP_COMM_AND_COMP ON, OFF If enabled, commmunication and computation might be overlapped, depending on the built-in heuristics.
COSMA_ADAPT_STRATEGY ON, OFF If enabled, COSMA will try to natively use the scalapack layout, without transforming to the COSMA layout. Used only in the pxgemm wrapper.
COSMA_CPU_MAX_MEMORY integer (size_t), by default: infinite CPU memory limit in megabytes per MPI process (rank). Allowing too little memory might reduce the performance.
COSMA_GPU_MEMORY_PINNING ON, OFF If enabled, COSMA will pin parts of the host memory to speed up CPU-GPU memory transfers. Used only in the GPU backend.
COSMA_GPU_MAX_TILE_M, COSMA_GPU_MAX_TILE_N, COSMA_GPU_MAX_TILE_K integer (size_t), by default: 5000 Tile sizes for each dimension, that are used to pipeline the local CPU matrices to GPU. K refers to the shared dimension and MxN refer to the dimensions of matrix C
COSMA_GPU_STREAMS integer (size_t), by default: 2 The number of GPU streams that each rank should use.
COSMA_MEMORY_POOL_AMORTIZATION real (double), by default 1.2 The growth factor for the memory pool. If equal to 1.2, then 1.2x the requested size is allocated (thus, 20% more than needed). Higher values better amortize the cost of the memory pool resizing which can occur when the algorithm is invoked for different matrix sizes. However, higher amortization values also mean that potentially more memory is allocated than used which can be a problem when the memory resource is tight.
COSMA_MIN_LOCAL_DIMENSION integer (size_t), by default: 200 If any matrix dimension becomes smaller than this threshold (after splitting the matrices among the available MPI ranks), then the actual number of ranks is reduced so that all matrix dimensions stay at or above this limit.
COSMA_DIM_THRESHOLD integer (size_t), by default: 0 In SCALAPACK wrappers, if any matrix dimension is less than this threshold, the problem is considered too small and is dispatched to SCALAPACK for computation. This only affects the SCALAPACK wrappers.
COSMA_CPU_MEMORY_ALIGNMENT integer (size_t), by default: 0 The number of bytes to which all cpu (host) buffers will be aligned.

These are all optional parameters. They are used in runtime and hence changing any of those does not require the code to be recompiled.

We further discuss in details how to set the limits for both CPU and GPU memory that COSMA is allowed to use.

Controlling GPU memory

Controlling how much GPU memory COSMA is allowed to use can be done by specifying the tile dimensions as:

export COSMA_GPU_MAX_TILE_M=5000
export COSMA_GPU_MAX_TILE_N=5000
export COSMA_GPU_MAX_TILE_K=5000

where K refers to the shared dimension and MxN refer to the dimensions of matrix C. By default, all tiles are square and have dimensions 5000x5000.

These are only the maximum tiles and the actual tile sizes that will be used might be less, depending on the problem size. These variables are only used in the GPU backend for pipelining the local matrices to GPUs.

It is also possible to specify the number of GPU streams:


The values given here are the default values.

The algorithm will then require device memory for at most this many elements:

num_streams * (tile_m * tile_k + tile_k * tile_n + tile_m * tile_n)

Therefore, by changing the values of these variables, it is possible to control the usage of GPU memory.

Controlling CPU memory

In case the available CPU memory is a scarce resource, it is possible to set the CPU memory limit to COSMA, by exporting the following environment variable:

export COSMA_CPU_MAX_MEMORY=1024 # in megabytes per MPI process (rank)

which will set the upper limit [in MB] on the memory that each MPI process (rank) is allowed to use. This might, however, reduce the performance.

In case the algorithm is not able to perform the multiplication within the given memory range, a runtime_error will be thrown.

This parameter is still in the testing phase!


Use -DCOSMA_WITH_PROFILING=ON to instrument the code. We use the profiler, called semiprof, written by Benjamin Cumming (

Running the miniapp locally (from the project root folder) with the following command:

mpirun --oversubscribe -np 4 ./build/miniapp/cosma-miniapp -m 1000 -n 1000 -k 1000 -P 4

Produces the following output from rank 0:

Matrix dimensions (m, n, k) = (1000, 1000, 1000)
Number of processors: 4

_p_ REGION                     CALLS      THREAD        WALL       %
_p_ total                          -       0.110       0.110   100.0
_p_   multiply                     -       0.098       0.098    88.7
_p_     computation                2       0.052       0.052    47.1
_p_     communication              -       0.046       0.046    41.6
_p_       copy                     3       0.037       0.037    33.2
_p_       reduce                   3       0.009       0.009     8.3
_p_     layout                    18       0.000       0.000     0.0
_p_   preprocessing                3       0.012       0.012    11.3

The precentage is always relative to the first level above. All time measurements are in seconds.


  • Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solca, Torsten Hoefler

Cite as:

  title={Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication},
  author={Kwasniewski, Grzegorz and Kabi{\'c}, Marko and Besta, Maciej and VandeVondele, Joost and Solc{\`a}, Raffaele and Hoefler, Torsten},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},


For questions, feel free to contact us, and we will soon get back to you:

If you need any help with the integration of COSMA into your library, we will be more than happy to help you!


This work was funded in part by:

ETH Zurich: Swiss Federal Institute of Technology in Zurich
CSCS: Swiss National Supercomputing Centre
PASC: Platform for Advanced Scientific Computing
ERC: European Research Council (Horizon2020, grant agreement DAPP, No.678880)
MaX: Materials design at the Exascale (Horizon2020, grant agreement MaX CoE, No. 824143.)

We thank Thibault Notargiacomo, Sam Yates, Benjamin Cumming and Simon Pintarelli for their generous contribution to the project: great ideas, useful advices and fruitful discussions.

cosma's People


adhocman avatar bors[bot] avatar dmikushin avatar haampie avatar kabicm avatar mtaillefumier avatar pkestene avatar rasolca avatar rmeli avatar simonpintarelli avatar teonnik avatar wardlt avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cosma's Issues

CMake compilation is completely broken

I tried to compile the source code using the given CMakeLists.txt, on a Intel+Linux environment and relying on Intel oneAPI for C/C++, Fortran and MPI compilers, and for the MKL library.
Using the provided CMakeLists.txt I got this error

-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
-- Could NOT find MPI (missing: MPI_CXX_FOUND)
    Reason given by package: MPI component 'C' was requested, but language C is not enabled.  MPI component 'Fortran' was requested, but language Fortran is not enabled.

Hence, I modified the file to include the required languages. If I do so, it successfully loads MPI_C and MPI_Fortran, but it is totally unable to load MPI_CXX, as shown by this error

-- Found MPI_C: /home/sw/openMPI/4.0.3/lib/ (found version "3.1")
-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
-- Found MPI_Fortran: /home/sw/openMPI/4.0.3/lib/ (found version "3.1")
CMake Error at /home/sw/cmake/3.19.1/share/cmake-3.19/Modules/FindPackageHandleStandardArgs.cmake:218 (message):
  Could NOT find MPI (missing: MPI_CXX_FOUND) (found version "3.1")

Even trying to avoid the configuration of MPI_CXX, the CMake script completely fails to load the MKL library.

Is there any update to the CMake configuration which allows to an easy compilation of this library?

Improve the performance of the GPU backend for small matrices

As observed by @pseewald, when COSMA is compiled with the GPU-backend, the performance for small matrices has a large overhead.

The overhead comes from the price of CPU<->GPU transfers which doesn't pay off for small matrices. In this case, it is better to perform the local multiplications directly on CPU.

trouble compiling with OpenBLAS

Hi! I'm trying to test out COSMA with OpenBLAS and running into problems. Importantly, I'm not using the CMake install of OpenBLAS, as using it gives me issues in a separate project; I'm using the standard make based install of OpenBLAS. I used the following steps to compile COSMA:

mkdir build
make -j cosma_miniapp

I get out the errors:

../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, double, double const*, int, double const*, int, double, double*, int)':
blas.cpp:(.text+0x4c): undefined reference to `cblas_dgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, double, double const*, int, double const*, int, double, double*, int)'
../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, std::complex<double>, std::complex<double> const*, int, std::complex<double> const*, int, std::complex<double>, std::complex<double>*, int)':
blas.cpp:(.text+0xdc): undefined reference to `cblas_zgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, void const*, void const*, int, void const*, int, void const*, void*, int)'
../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, float, float const*, int, float const*, int, float, float*, int)':
blas.cpp:(.text+0x14c): undefined reference to `cblas_sgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, float, float const*, int, float const*, int, float, float*, int)'
../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, std::complex<float>, std::complex<float> const*, int, std::complex<float> const*, int, std::complex<float>, std::complex<float>*, int)':
blas.cpp:(.text+0x1d8): undefined reference to `cblas_cgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, void const*, void const*, int, void const*, int, void const*, void*, int)'
collect2: error: ld returned 1 exit status

I've been playing around with the final link command here but to no avail. Any suggestions?

CUDAToolkit not specified in CMakeLists.txt

You do not specify the path to the CUDA libraries. Therefore, one can't install COSMA with CUDA support.
When running



- Selected BLAS backend for COSMA: CUDA
-- Selected SCALAPACK backend for COSMA: OFF
-- cxxopts version 2.2.0
CUDA not found. Please specify CUDA_PATH variable.
CMake Error at /usr/local/lib/python2.7/dist-packages/cmake/data/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
Call Stack (most recent call first):
  /usr/local/lib/python2.7/dist-packages/cmake/data/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  libs/Tiled-MM/cmake/FindCUBLAS.cmake:51 (find_package_handle_standard_args)
  libs/Tiled-MM/CMakeLists.txt:30 (find_package)

Shared library

Please, include a CMAKE option to build COSMA as a shared library.

Amortize pinning/unpinning of memory.

Instead of pinning/unpinning of buffers within each local_multiply call, amortize it by reusing pinned buffers from the previous multiplication if possible. This can be done through the context.

Multiple test cases in test.pdgemm segfault with OpenMPI

The test cases are: 10, 11, 13, 21, 22 and 24

Setup: [email protected], [email protected] and [email protected]

Error message for test case 10:

2: [T480s:485190] *** Process received signal ***
2: [T480s:485188] *** Process received signal ***
2: [T480s:485188] Signal: Segmentation fault (11)
2: [T480s:485188] Signal code: Address not mapped (1)
2: [T480s:485188] Failing at address: 0xffffffff82a03518
2: [T480s:485188] [T480s:485190] Signal: Segmentation fault (11)
2: [T480s:485190] Signal code: Address not mapped (1)
2: [T480s:485190] Failing at address: 0xffffffffaa9cc258
2: [T480s:485190] [ 0] /usr/lib/[0x7ffba86950f0]
2: [T480s:485190] [ 1] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/[0x7ffbaf83cd17]
2: [T480s:485190] [ 2] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7ffba846643a]
2: [T480s:485190] [ 3] [ 0] /usr/lib/[0x7f09e01740f0]
2: [T480s:485188] [ 1] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/[0x7f09e731bd17]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7ffbaff869bc]
2: [T480s:485190] [ 4] [T480s:485188] [ 2] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7f09dff4543a]
2: [T480s:485188] [ 3] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7ffbaffebd27]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7f09e7a659bc]
2: [T480s:485188] [ 4] [T480s:485190] [ 5] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7f09e7acad27]
2: [T480s:485188] [ 5] ./test.pdgemm(+0x2153d)[0x555d81b1c53d]
2: [T480s:485188] [ 6] ./test.pdgemm(+0x21f5c)[0x555d81b1cf5c]
2: [T480s:485188] [ 7] ./test.pdgemm(+0x15153)[0x555d81b10153]
2: [T480s:485188] [ 8] ./test.pdgemm(+0x532d7)[0x555d81b4e2d7]
2: [T480s:485188] [ 9] ./test.pdgemm(+0x494b9)[0x555d81b444b9]
2: [T480s:485188] [10] ./test.pdgemm(+0x49785)[0x555d81b44785]
2: [T480s:485188] [11] ./test.pdgemm(+0x49915)[0x555d81b44915]
2: [T480s:485188] [12] ./test.pdgemm(+0x4a033)[0x555d81b45033]
2: [T480s:485188] [13] ./test.pdgemm(+0x53857)[0x555d81b4e857]
2: [T480s:485188] [14] ./test.pdgemm(+0x4a174)[0x555d81b45174]
2: [T480s:485188] [15] ./test.pdgemm(+0xf1a3)[0x555d81b0a1a3]
2: [T480s:485188] [16] /usr/lib/[0x7f09dfa3d152]
2: [T480s:485188] [17] ./test.pdgemm(+0x1010e)[0x555d81b0b10e]
2: [T480s:485188] *** End of error message ***
2: ./test.pdgemm(+0x2153d)[0x55aaaa49053d]
2: [T480s:485190] [ 6] ./test.pdgemm(+0x21f5c)[0x55aaaa490f5c]
2: [T480s:485190] [ 7] ./test.pdgemm(+0x15153)[0x55aaaa484153]
2: [T480s:485190] [ 8] ./test.pdgemm(+0x532d7)[0x55aaaa4c22d7]
2: [T480s:485190] [ 9] ./test.pdgemm(+0x494b9)[0x55aaaa4b84b9]
2: [T480s:485190] [T480s:485187] *** Process received signal ***
2: [T480s:485189] *** Process received signal ***
2: [T480s:485189] Signal: Segmentation fault (11)
2: [T480s:485189] Signal code: Address not mapped (1)
2: [T480s:485189] Failing at address: 0xffffffff96ebf318
2: [T480s:485187] Signal: Segmentation fault (11)
2: [T480s:485187] Signal code: Address not mapped (1)
2: [T480s:485187] Failing at address: 0x69f95bc8
2: [T480s:485187] [ 0] /usr/lib/[0x7fe8df4050f0]
2: [T480s:485187] [ 1] [T480s:485189] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/[0x7fe8e65acd17]
2: [T480s:485187] [10] ./test.pdgemm(+0x49785)[0x55aaaa4b8785]
2: [T480s:485190] [11] ./test.pdgemm(+0x49915)[0x55aaaa4b8915]
2: [T480s:485190] [12] ./test.pdgemm(+0x4a033)[0x55aaaa4b9033]
2: [T480s:485190] [13] [ 2] ./test.pdgemm(+0x53857)[0x55aaaa4c2857]
2: [T480s:485190] [14] ./test.pdgemm(+0x4a174)[0x55aaaa4b9174]
2: [T480s:485190] [15] ./test.pdgemm(+0xf1a3)[0x55aaaa47e1a3]
2: [T480s:485190] [16] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7fe8df1d643a]
2: [T480s:485187] [ 3] [ 0] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7fe8e6cf69bc]
2: [T480s:485187] [ 4] /usr/lib/[0x7f36e61e10f0]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7fe8e6d5bd27]
2: [T480s:485187] [ 5] ./test.pdgemm(+0x2153d)[0x558b68caa53d]
2: [T480s:485187] [ 6] ./test.pdgemm(+0x21f5c)[0x558b68caaf5c]
2: [T480s:485187] [ 7] ./test.pdgemm(+0x15153)[0x558b68c9e153]
2: [T480s:485187] [ 8] ./test.pdgemm(+0x532d7)[0x558b68cdc2d7]
2: [T480s:485187] [ 9] ./test.pdgemm(+0x494b9)[0x558b68cd24b9]
2: [T480s:485187] [10] ./test.pdgemm(+0x49785)[0x558b68cd2785]
2: [T480s:485187] [11] ./test.pdgemm(+0x49915)[0x558b68cd2915]
2: [T480s:485187] [12] ./test.pdgemm(+0x4a033)[0x558b68cd3033]
2: [T480s:485187] [13] ./test.pdgemm(+0x53857)[0x558b68cdc857]
2: [T480s:485187] [14] ./test.pdgemm(+0x4a174)[0x558b68cd3174]
2: [T480s:485187] [15] ./test.pdgemm(+0xf1a3)[0x558b68c981a3]
2: [T480s:485187] [16] /usr/lib/[0x7fe8decce152]
2: [T480s:485187] [17] ./test.pdgemm(+0x1010e)[0x558b68c9910e]
2: [T480s:485187] *** End of error message ***
2: [T480s:485189] [ 1] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/[0x7f36ed388d17]
2: [T480s:485189] [ 2] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7f36e5fb243a]
2: [T480s:485189] [ 3] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7f36edad29bc]
2: [T480s:485189] [ 4] /usr/lib/[0x7ffba7f5e152]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/[0x7f36edb37d27]
2: [T480s:485190] [17] ./test.pdgemm(+0x1010e)[0x55aaaa47f10e]
2: [T480s:485190] *** End of error message ***
2: [T480s:485189] [ 5] ./test.pdgemm(+0x2153d)[0x559c951be53d]
2: [T480s:485189] [ 6] ./test.pdgemm(+0x21f5c)[0x559c951bef5c]
2: [T480s:485189] [ 7] ./test.pdgemm(+0x15153)[0x559c951b2153]
2: [T480s:485189] [ 8] ./test.pdgemm(+0x532d7)[0x559c951f02d7]
2: [T480s:485189] [ 9] ./test.pdgemm(+0x494b9)[0x559c951e64b9]
2: [T480s:485189] [10] ./test.pdgemm(+0x49785)[0x559c951e6785]
2: [T480s:485189] [11] ./test.pdgemm(+0x49915)[0x559c951e6915]
2: [T480s:485189] [12] ./test.pdgemm(+0x4a033)[0x559c951e7033]
2: [T480s:485189] [13] ./test.pdgemm(+0x53857)[0x559c951f0857]
2: [T480s:485189] [14] ./test.pdgemm(+0x4a174)[0x559c951e7174]
2: [T480s:485189] [15] ./test.pdgemm(+0xf1a3)[0x559c951ac1a3]
2: [T480s:485189] [16] /usr/lib/[0x7f36e5aaa152]
2: [T480s:485189] [17] ./test.pdgemm(+0x1010e)[0x559c951ad10e]
2: [T480s:485189] *** End of error message ***
2: --------------------------------------------------------------------------
2: Primary job  terminated normally, but 1 process returned
2: a non-zero exit code. Per user-direction, the job has been aborted.
2: --------------------------------------------------------------------------
2: --------------------------------------------------------------------------
2: mpiexec noticed that process rank 3 with PID 0 on node T480s exited on signal 11 (Segmentation fault).
2: --------------------------------------------------------------------------
1/1 Test #2: test.pdgemm ......................***Failed    4.26 sec

Wrong results when beta>0 in a specific case

The pdgemm wrapper gives wrong results, when pdgemm is invoked with the following parameters:

// ****************************************
// *       INPUT PARAMETERS BEGIN         *
// ****************************************
// *  global dimensions  *
// ***********************
// matrix A
int ma = 1280; // rows
int na = 1280; // cols

// matrix B
int mb = 1280; // rows
int nb = 1280; // cols

// matrix C
int mc = 1280; // rows
int nc = 1280; // cols

// ***********************
// *     block sizes     *
// ***********************
// matrix A
int bma = 32; // rows
int bna = 32; // cols

// matrix B
int bmb = 32; // rows
int bnb = 32; // cols

// matrix C
int bmc = 32; // rows
int bnc = 32; // cols

// ***********************
// *   submatrices ij    *
// ***********************
// matrix A
int ia = 1; // rows
int ja = 545; // cols

// matrix B
int ib = 513; // rows
int jb = 545; // cols

// matrix C
int ic = 1; // rows
int jc = 513; // cols

// ***********************
// *    problem size     *
// ***********************
int m = 512;
int n = 32;
int k = 736;

// ***********************
// *   transpose flags   *
// ***********************
char trans_a = 'N';
char trans_b = 'T';

// ***********************
// *    scaling flags    *
// ***********************
double alpha = 1.0;
double beta = 1.0;

// ***********************
// *    leading dims     *
// ***********************
int lld_a = 640;
int lld_b = 640;
int lld_c = 640;

// ***********************
// *      proc grid      *
// ***********************
int p = 2; // rows
int q = 4; // cols

// ***********************
// *      proc srcs      *
// ***********************
// matrix A
int src_ma = 0; // rows
int src_na = 0; // cols

// matrix B
int src_mb = 0; // rows
int src_nb = 0; // cols

// matrix C
int src_mc = 0; // rows
int src_nc = 0; // cols

// ****************************************
// *         INPUT PARAMETERS END         *
// ****************************************

test.scalar_matmul never ends


test.scalar_matmul is probably broken: it is computed endlessly with any BLAS flavor (OPENBLAS, MKL, or CUDA). CMake options are:

  cmake ../cosma \
          -DCMAKE_INSTALL_PREFIX=/usr \

Compilers and libraries: GCC 10.2.0, OpenMPI, CUDA 11.1.105, OpenBLAS 0.3.12, MKL 2020.4.304, ScaLAPACK 2.1.0

unexpected performance when using COSMA with CPU

I'm seeing the following weak scaling performance when using COSMA configured with OpenBLAS:

Num Nodes Avg Exec Time (ms)
1 2867
2 3026
4 3208
8 3090
16 5659
32 5730
64 5741
128 5893
256 5889

On each node, I'm using 20 threads for OpenMP. The initial problem size / command line is env OMP_NUM_THREADS=20 COSMA_OVERLAP_COMM_AND_COMP=ON jsrun -b none -c ALL_CPUS -g ALL_GPUS -r 1 -n 1 /g/g15/yadav2/cosma/build/miniapp/cosma_miniapp -r 10 -m 8192 -n 8192 -k 8192. I'm on commit c7bdab9.

I'm not sure what happened at 16 nodes that caused the performance dip -- is something like this expected?

Switching to a proper memory-pool implementation

The current implementation allocates one large piece of memory from which all the buffers are taken. This has its advantages, but adds the requirement that all the required memory must be available as one, consecutive piece, which might not be possible in case of fragmentation (see here).

We want to switch to using a proper memory pool.

Fixing CI/CD issues

Hey @haampie!

It seems CI/CD complains about a missing OpenMP flag, although COSMA doesn't use OpenMP, but one of it's dependencies (COSTA) has OpenMP as a dependency.

This part hasn't been changed since the last time it worked, so I don't understand what's the issue. Locally and on daint it works for me without problems. Can you please have a look?

test.multiply failed when built with CUDA support

A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  comp
  System call: unlink(2) /dev/shm/osc_rdma.comp.59100001.9
  Error:       No such file or directory (errno 2)
[comp:35304] Read -1, expected 2048, errno = 14
[comp:35304] Read -1, expected 2048, errno = 14
[comp:35304] *** An error occurred in MPI_Accumulate
[comp:35304] *** reported by process [1494220801,3]
[comp:35304] *** on win rdma window 9
[comp:35304] *** MPI_ERR_OTHER: known error not in list
[comp:35304] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[comp:35304] ***    and potentially your MPI job)
<end of output>
Test time =   4.24 sec
Test Failed.
"test.multiply" end time: Jun 24 19:41 MSK
"test.multiply" time elapsed: 00:00:04


Clang emits a lot of warnings


[2/58] Building CXX object src/cosma/CMakeFiles/cosma_pxgemm_cpp.dir/scalapack.cpp.o
In file included from /home/teonnik/code/cosma/src/cosma/scalapack.cpp:1:
In file included from /home/teonnik/code/cosma/src/cosma/scalapack.hpp:5:
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:158:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
    local_grid_coord() = default;
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:155:21: note: default constructor of 'local_grid_coord' is implicitly deleted because field 'el_coord' has no default constructor
    elem_grid_coord el_coord;
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:192:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
    matrix_grid() = default;
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:189:16: note: default constructor of 'matrix_grid' is implicitly deleted because field 'matrix_dimension' has no default constructor
    matrix_dim matrix_dimension;
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:213:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
    local_blocks() = default;
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:207:15: note: default constructor of 'local_blocks' is implicitly deleted because field 'block_dimension' has no default constructor
    block_dim block_dimension;
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:244:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
    data_layout() = default;
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:239:16: note: default constructor of 'data_layout' is implicitly deleted because field 'matrix_dimension' has no default constructor
    matrix_dim matrix_dimension;
4 warnings generated.

Full report:

Add a memory limit variable

COSMA is able to deal with limited memory, but the available memory per rank should be specified by the user. If not specified, infinite memory is assumed. We want to add an environment variable that will specify the amount of available memory.

libgrid2grid library installed in the wrong directory

I try to build and install COSMA on piz daint:
source scripts/

make -j 8
make install

All library files are copied to the location <installation dir>/cosma/lib64 but libgrid2grid.a is copied to <installation dir>/cosma/lib.

cosma conda package on conda-forge?

For a related discussion see electronic-structure/SIRIUS#621

Over the past few weeks we've been adding conda packages for spla, spfft and sirius and I'm now in the process of integrating them into the cp2k one.

COSMA may be somewhat of a fringe case for conda since conda binaries are unlikely to be run on really large-scale machines (for maximum performance you would compile it yourself) but it could still be useful for reproducibility and quick testing.

Once could start with a MPI-only build which I think would be straightforward; it is also possible to create GPU-enabled conda packages but there is little documentation, so this would probably better be left to a later stage.

Let me know what you think!

out-of-memory events

I have OOM events when working with quite large matrix dimensions.

I am working with commit a7c6bb3
on Intel XEON Platinum 8360Y with 72 cores per node and 256 GB RAM)
using the following libraries: intel-mpi, intel-mkl, and intel-mpi (not too old versions).

When working on 2 of my 72 core machines and executing the following command:
srun /u/airmler/src/COSMA/miniapp/cosma_miniapp -m 36044 -n 36044 -k 36044 -r 5
(which is identical to mpirun -np 144 $EXE, when working with another wrapper),
I obtain a OOM event (slurm kills the job because some process is OOM).

However, he says in stdout:
Required memory per rank (in #elements): 171415254
which is around 1.3GB. This is way less than the 3.7GB per rank I should have available.

What helps is when I reduce the used memory:
is running without problems.

On this cluster, it is not so easy to profile the memory consumption. Are you aware of any load inbalances? Is the provided "required memory per rank" reliable?

Note: it is not important for me that this issue is resolved. I simply want to share this behavior with you.
I am sure you can transfer the matrix dimensions in a way you can reproduce the problem at your cluster.
Otherwise I could try to run some suggested examples on "my" machine.

Best regards.

cosma_miniapp fails if -s option is not specified.

Even if the -s option is marked as optional in the doc, cosma_miniapp terminates with a domain_error exception if this option is not given a value.

Please add a default value or check if the value of the option is available before accessing it.

Compilation fails with CUDA 9

When compiling COSMA using CUDA 9, and the following instructions:

make -j

The compilation fails with:

/COSMA/benchmarks/gpu_gemm_cublas.cpp:5:10: fatal error: cublasLt.h: No such file or directory
 #include <cublasLt.h>
compilation terminated.

The header cublasLt.h is indeed not included in CUDA 9.
Compiling with CUDA 10, however, works fine.

COSMA cublas crash after job finished

I realized I cannot reopen the issue #87

copy the question here. Sorry for the confusion.


Hi @kabicm ,
I am able to redo the test again and the problem still exists with v2.5.0 and the master branch. Does COSMA use a wrapper over MPI_Finalize()? I notice a similar issue on Summit here with another code that is wrapping over MPI_Finalize LLNL/Caliper#392.

If that's the case, my question becomes: Is it possible to manually finalize COSMA?

A related question here: if I called COSMA, does it take the GPU memory after the gemm calls? Is it possible to control those GPU memories? I guess I am looking for something like initiating and finalizing a COSMA environment over a certain code region and free up the GPU memories when outside the code region.

Best wishes,

Add support for hipBLAS

It would be nice to have support for hipBLAS as this allows to test the hip code path on a Nvidia device.

In return the support for rocBLAS could be dropped.

cannot find lgrid2grid

Dear developers, I am trying to link against COSMA via the USE COSMA in 30 seconds. However, it tells me grid2grid is not found. I saw you updated grid2grid to COSTA. Should the linking libs also update? If yes, what's the correct lib to be linked? I saw libcosta.a, libcosta_prefixed_scalapack.a, and libcosta_scalapack.a.

Thank you so much.

Multiply using only 1 process

The current code assumes that at least 2 MPI processes are used. If only 1 process is used, then there are no any divisions, so strategy is empty, which causes a segfault when the matrices are created. Specifically, when matrices are created, the buffers are created as well, and the segfault appears in compute_buff_sizes in n_buckets_[step] == 1 line. n_buckets_ has size equal to strategy_->size, which is empty in this case, thus causing a segfault.

Multi GPU support

At the moment, it is assumed that each rank uses a single GPU. However, we should also be able to deal with mutliple GPUs being available to the same MPI rank.

Remove beta parameter from strategy

Currently, beta parameter in strategy is of type double (thus not templated). However, there is no need for strategy to contain this parameter. Strategy has to be decoupled from the command line parser (options) and beta should be removed from strategy.

Compiling with intel/17.0.4, impi/17.3 and mkl/17.3


I run into a few minor errors while trying to compile with intel/17.0.4 and impi/17.3.
The first one is in grid2grid: kabicm/grid2grid/issues/10

The others are:

src/cosma/strategy.cpp(244): error: no instance of constructor "std::vector<_Tp, _Alloc>::vector [with _Tp=std::tuple<long long, int, char>, _Alloc=std::allocator<std::tuple<long long, int, char>>]" matches the argument list
            argument types are: ({...}, {...}, {...})
      std::vector<dim_pair> dims = {{m, divm, 'B'}, {n, divn, 'A'}, {k, divk, 'C'}};

src/cosma/strategy.cpp(278): error: copy-list-initialization cannot use a constructor marked "explicit"
      return {memory_A, memory_B, memory_C};


cmake --install . --prefix /my/custom/prefix fails with permission denied

on Alps/Eiger:

$ cmake --install . --prefix ../install.cosma
-- Install configuration: "Release"
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/blas.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/layout.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/memory_pool.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/random_generator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/pinned_buffers.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/local_multiply.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/timer.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/multiply.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/cinterface.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/communicator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/context.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/environment_variables.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/statistics.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/mpi_mapper.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/profiler.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/matrix.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/blacs.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/cosma_pxgemm.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/mapper.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/scalapack.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/one_sided_communicator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/two_sided_communicator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/strategy.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/buffer.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/math_utils.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/pxgemm_params.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/interval.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaConfig.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaConfigVersion.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/FindMKL.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/pkgconfig/cosma.pc
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridTargets.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridTargets-release.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/transform.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/communication_data.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/memory_utils.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/mpi_type_wrapper.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/profiler.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/ranks_reordering.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/cantor_mapping.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/grid_cover.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/transformer.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/grid2D.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/block.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/grid_layout.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/comm_volume.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/scalapack_layout.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/tiling_manager.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/interval.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridConfig.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridConfigVersion.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libgrid2grid.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libcosma.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libcosma_pxgemm.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libcosma_pxgemm_cpp.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaTargets.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaTargets-release.cmake
-- Installing: /usr/local/bin/test.mapper
CMake Error at tests/cmake_install.cmake:60 (file):
  file INSTALL cannot copy file
  "/users/timuel/work/cp2k/build.cosma/tests/test.mapper" to
  "/usr/local/bin/test.mapper": Permission denied.
Call Stack (most recent call first):
  cmake_install.cmake:67 (include)

COSMA cublas crash after job finished

Dear COSMA developers,

I have been working on a GPU hackathon and testing COSMA (commit e034ddd) using the Scalapack API with FHIaims code on the Ascent cluster (The training cluster for Summit). With COSMA GPU version, I was able to get a 7x speedup for pzgemm (matrix size 3312 * 3312) (36 Power CPU cores + 6 V100 GPU v. 36 Power CPU cores) (36 MPI with OMP_THREADS=1). That is great. However, I have also seen some GPU errors after the job finished. The errors only exist if I link my code against COSMA. @kabicm suggests it could be something that happened during the cleanup stage. Unfortunately, I won't have the access to the cluster anymore now, so that I am not able to provide a minimal example to reproduce the error. We will try to get access to Summit cluster later.

This is the error I saw.

error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Here are my build and link and submitting scripts.

set -e
module purge
module load gcc/7.4.0
module load essl
module load cuda/10.1.243
module load spectrum-mpi
module load netlib-lapack
module load netlib-scalapack
module load cmake


export CC=mpicc
export CXX=mpicxx
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=CUSTOM -DCMAKE_INSTALL_PREFIX=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy ..
make VERBOSE=0 -j 16
make install
set(CMAKE_C_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -funwind-tables -fopenmp" CACHE STRING "")

set(CMAKE_Fortran_COMPILER "mpif90" CACHE STRING "")
set(CMAKE_Fortran_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")
set(Fortran_MIN_FLAGS "-O0 -g -fbacktrace -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")

set(CMAKE_CUDA_FLAGS "-O2 -g -DAdd_ -arch=sm_70" CACHE STRING "")


SET(INC_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/include" CACHE STRING "")

set(LIB_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64 $ENV{OLCF_ESSL_ROOT}/lib64 $ENV{OLCF_CUDA_ROOT}/lib64" CACHE STRING "")
set(LIBS "cosma_pxgemm cosma costa_scalapack costa Tiled-MM scalapack essl lapack cublas cudart" CACHE STRING "")

#BSUB -W 2:00
#BSUB -nnodes 1
#BSUB -alloc_flags gpumps
#BSUB -J aims-gw
#BSUB -o aims.%J
#BSUB -N [email protected]

module purge
module load gcc/7.4.0 spectrum-mpi/  cuda/10.1.243 essl/6.1.0-2 netlib-lapack/3.8.0 netlib-scalapack/2.0.2
module load nsight-systems/2021.2.1.58


#export LD_LIBRARY_PATH=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64



ulimit -s unlimited

jsrun -n 6 -a 6 -c 6 -g 1 -r 6 $bin > aims.out

Best wishes,

COSMA miniapp on Summit

Hello! I am trying out COSMA on Summit. Built COSMA using -DCOSMA_BLAS=CUDA -DCOSMA_WITH_PROFILING=ON using GCC 8.1, CUDA 10.1 and IBM Spectrum MPI 10.3

Testing the miniapp as follows using 3 nodes, 6 mpi ranks, 6 GPUs per node:

cosma/build/miniapp/cosma_miniapp -m 1000 -n 1000 -k 1000 -P 18

The result I get is

Strategy = Matrix dimensions (m, n, k) = (1000, 1000, 1000)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy: 
Required memory per rank (in #elements): 166668
Available memory per rank (in #elements): 9223372036854775807

_p_ REGION                     CALLS      THREAD        WALL       %
_p_ total                          -       0.005       0.005   100.0
_p_   multiply                     -       0.004       0.004    86.8
_p_     computation                1       0.004       0.004    86.8
_p_     other                      2       0.000       0.000     0.0
_p_   preprocessing                -       0.001       0.001    13.2
_p_     communicators              1       0.001       0.001    12.8
_p_     matrices                   -       0.000       0.000     0.4
_p_       mapper                   -       0.000       0.000     0.3
_p_         coordinates            3       0.000       0.000     0.2
_p_         sizes                  3       0.000       0.000     0.1
_p_       layout                   3       0.000       0.000     0.0
_p_       buffer                   3       0.000       0.000     0.0
_p_     allocation                 2       0.000       0.000     0.0
COSMA TIMES [ms] = 5 

Am not sure why the Number of processors is reported as 1 instead of 18.

segfault for CP2K RPA

I get a segfault in COSMA for CP2K RPA on a system of 256 water molecules. The backtrace is:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#1  0x5529c30 in _ZN5cosma6Mapper5ownerERNS_10Interval2DE
    at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/COSMA/src/cosma/mapper.cpp:362
#2  0x5529c30 in _ZN5cosma6Mapper15get_layout_gridEv
    at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/COSMA/src/cosma/mapper.cpp:401
#3  0x550b504 in _ZN5cosma6pxgemmIdEEvcciiiT_PKS1_iiPKiS3_iiS5_S1_PS1_iiS5_
    at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/COSMA/src/cosma/cosma_pxgemm.cpp:190
#4  0x246aad2 in __cp_fm_basic_linalg_MOD_cp_fm_gemm
    at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/cp2k/src/fm/cp_fm_basic_linalg.F:449
#5  0xf27025 in __cp_gemm_interface_MOD_cp_gemm
    at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/cp2k/src/cp_gemm_interface.F:136
#6  0x1d59ede in contract_s_to_q
    at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/cp2k/src/rpa_util.F:810

The calculation was run on Piz Daint gpu partition / 1024 nodes with 12 OMP threads per rank.

CP2K input file:

I used COSMA commit 374ddb4

Substitute grid2grid with COSTA

The library grid2grid is outdated and should be substituted with COSTA. COSTA is a higly efficient communication-optimal algorithm for matrix redistribution with the option to transpose or scale (multiply by a scalar) the matrix on-the-fly. Concretely, it implements the routine:
sub(B) = beta * sub(B) + alpha * sub(op(A)) ; op=N, T or C; sub = submatrix,
where the operation op corresponds to no operation (N), transpose (T) or conjugate (C).

Segfault with tight memory usage

When the memory usage is tight (no alignment, pool growing factor=1), the segfault occurs in the following case:

            // matrix dimensions
            1280, 1280, // matrix A
            1280, 1280, // matrix B
            1280, 1280, // matrix C

            // block sizes
            32, 32, // matrix A
            32, 32, // matrix B
            32, 32, // matrix C

            // submatrices ij
            1, 545, // matrix A
            513, 545, // matrix B
            1, 513, // matrix C

            // problem size
            512, 32, 736,

            // transpose flags
            'N', 'T',

            // scaling flags
            1.0, 1.0,

            // leading dims
            640, 640, 640,

            // proc grid
            2, 4, 'R',

            // proc srcs
            0, 0, // matrix A
            0, 0, // matrix B
            0, 0  // matrix C

Only when beta!=0. This is caused by swapping reduce_buffer with buffer_i, when reduce_buffer.size() < buffer_i.size().

Test case 29 of test.pdgemm often (not always) segfaults

Built with clang, MKL + MPICH:

cmake ${src_dir} -G Ninja \
  -D CMAKE_INSTALL_PREFIX="~/software/cosma" \
  -D CMAKE_BUILD_TYPE=RelWithDebInfo \

Dependencies were built with spack (MKL/MPICH):

w6chlh7     [email protected]~ilp64+shared threads=none
e3ni7ed     [email protected]~argobots~fortran+hwloc+hydra+libxml2+pci+romio~slurm~verbs+wrapperrpath device=ch3 netmod=tcp patches=eb982de3366d48cbc55eb5e0df43373a45d9f51df208abf0835a72dc6c0b4774 pmi=pmi
h5qfs34         [email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared
gt474jm             [email protected]
ptvhtfi             [email protected]~python

The test case:

mpiexec -n 16 tests/test.pdgemm --gtest_filter=Default/PdgemmTestWithParams.pdgemm/29

Error message:

=   PID 186245 RUNNING AT T480s
=   EXIT CODE: 139

Fix for the CI test.yml script

The issue #66 revealed that test.scalar_matmul was not run in CI.

After looking at ci/test.yml it seems that scalar_matmul and multiply_using_layout tests both execute the test.mapper script. The other parameters seem fine.

Can you verify this, @haampie?

unexpected performance when using COSMA with GPU (single node)

I'm testing out COSMA with GPU support on a single node with GPU's, and I'm not seeing performance that I might expect.

1 GPU:
COSMA TIMES [ms] = 1562 1657 2133 2390 6865
2 GPU:
COSMA TIMES [ms] = 1544 2710 3374 3626 6060
4 GPU:
COSMA TIMES [ms] = 805 832 1456 3142 6419

I expect to:

  • See some difference in runtime from 1 -> 2 GPUs
  • Somewhat stable performance? The difference between the min and max are quite high.

I'm on the current master, and running the miniapp with (-n and -r are how many ranks to run on a node)

OMP_NUM_THREADS=6 COSMA_OVERLAP_COMM_AND_COMP=ON jsrun -n 4 -c 6 -g 1 -b none -r 4 ./miniapp/cosma_miniapp -m 16384 -n 16384 -k 16384 -r 5

I build cosma with:


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.