Giter Site home page Giter Site logo

cp2k / dbcsr Goto Github PK

View Code? Open in Web Editor NEW
135.0 20.0 46.0 554.65 MB

DBCSR: Distributed Block Compressed Sparse Row matrix library

Home Page: https://cp2k.github.io/dbcsr/

License: GNU General Public License v2.0

Makefile 0.49% C++ 4.20% Fortran 74.18% C 8.34% Shell 1.89% Python 7.42% Gnuplot 0.09% CMake 2.60% Jupyter Notebook 0.53% Roff 0.15% Groovy 0.12%
cp2k blas matrix-multiplication gemm cuda sparse-matrix openmp-parallelization mpi hpc linear-algebra

dbcsr's Issues

setup bug-reporting templates

with examples on how to obtain the required information we need to be able to reproduce the problem for:

  • MPI implementation and version
  • compiler vendor, version, flags
  • OS/platform
  • BLAS/LAPACK implementation and versions
  • CUDA compiler version and target platform

Use a centralized DEBUG flag

Drop all

LOGICAL, PARAMETER :: debug_mod = .FALSE.
LOGICAL, PARAMETER :: careful_mod = .FALSE.

....

IF (debug_mod) THEN
      ...
 ENDIF

IF (careful_mod) THEN
      ...
 ENDIF


CMake: install target missing

I am sure you are aware of this, but CMake build lacks install target.

Probably, this should be (in case of a general out-of-source build):

  • <CMAKE_BUILD_DIR>/src/*.mod -> prefix/include/*
  • <CMAKE_BUILD_DIR>/src/*.a|so|dylib -> prefix/lib/*
  • <CMAKE_SOURCE_DIR>/src/*.h -> prefix/include/*

[cmake] introduce blacklist facilities

we know that certain OpenMPI, compiler or BLAS/LAPACK versions are broken

to reduce the workload it might be beneficial to maintain a list of broken deps in CMake

'dbcsr_trace_ab' is only correct for a symmetric matrix 'b'

This is a rather minor issue, since qc-developers will usually use this with real symmetric matrices,
e.g., Tr[PF] or Tr[PS], which will work just fine. To be formally correct, one could either rename this
function to 'dot_ab' or actually transpose matrix b.

[cmake] installation should contain FooConfig.cmake

the code snippet #55 (comment) is not the best way of finding an external dbcsr library from a downstream project.

find_package() by downstream project shall be called in "config" mode, see stackoverflow for discussion and example.

One could do it with

CONFIGURE_FILE(
  ${CMAKE_SOURCE_DIR}/cmake/FooConfig.cmake.in
  ${CMAKE_BINARY_DIR}/FooConfig.cmake
  @ONLY
)

followed by

install (FILES
         "${CMAKE_CURRENT_BINARY_DIR}/FooConfig.cmake"
         DESTINATION "lib/cmake/Foo")

where FooConfig.cmake.in would contain things like

SET(Foo_INCLUDE_DIRS "...")
SET(Foo_LIBRARIES "...")

p.s. There should probably be components as well for this case, namely c and fortran, but that I have no idea about.

[libcusmm] libcusmm_unittest_transpose test failure

On our K20X system, the libcusmm_unittest_transpose test fails with the following message:

13/14 Test #13: libcusmm_unittest_transpose ...........................***Failed    2.43 sec
# Libcusmm has 295 blocksizes for transposition
Cannot transpose matrices with dimensions above 80, got (256 x 16)

@shoshijak is this CUDA arch or kernel dependent?

Recipe for using DBCSR/develop in CP2K

It would be handy to have a recipe for using DBCSR/develop in CP2K. The recipe is not meant to touch open questions (Git-submodules, etc.), but rather meant to have at least a list of (manual) steps to follow. It may also serve as a work-item to initially harmonize both sources (CP2K build system to account for renamed files, etc.).

Automate sync to CP2K SVN

The setup on a worker which is going to fetch Git commits in a bunch and pushes them to CP2K SVN could look as follows if we do not want to replicate the complete history from DBCSR in SVN anymore.

Initial setup:

svn co https://svn.code.sf.net/p/cp2k/code/trunk/cp2k/exts/dbcsr /tmp/dbcsr-sync
cd /tmp/dbcsr-sync
git init
git remote add origin https://github.com/cp2k/dbcsr.git
git fetch origin
git checkout -b master --track origin/master

# make sure .git is never added
echo '.git' >> .svnignore

The task runner would then have to do something along the lines:

cd /tmp/dbcsr-sync

git rev-parse HEAD
curr_hash=$(git rev-parse HEAD)
git pull --rebase

# adding new files is easy, simply force-add everything
svn add --force .
# removing files is a bit trickier
svn status | sed -e '/^!/!d' -e 's/^!//' | xargs svn rm

curr_hash=$(git rev-parse HEAD)

commit_msg=$(mktemp)
echo "Updating DBCSR from git @{$curr_hash}" > "${commit_msg}"
git log "${prev_hash}..${curr_hash}" >> "${commit_msg}"
svn commit --file "${commit_msg}"
rm "${commit_msg}"

This would result in squashed commits.

Other options/variations:

  • replicate individual commits via git-svn or custom scripts
  • do not pull from master (which is supposed to be stable) but yet another sync-to-CP2K branch

File extensions in the Makefile

I see that the makefile is checking for these file extensions:

OBJ_SRC_FILES = $(shell cd $(SRCDIR); find . -name ".F")
OBJ_SRC_FILES += $(shell cd $(SRCDIR); find . -name "
.c")
OBJ_SRC_FILES += $(shell cd $(SRCDIR); find . -name ".cpp")
OBJ_SRC_FILES += $(shell cd $(SRCDIR); find . -name "
.cxx")
OBJ_SRC_FILES += $(shell cd $(SRCDIR); find . -name "*.cc")

However, we have only targets for .F, .c, .cpp (and .cu).
Should we remove those extensions?

Tests hang

It stays this way forever:

Running tests...
/usr/local/bin/ctest --force-new-ctest-process 
Test project /usr/ports/math/dbcsr/work/.build
      Start  1: dbcsr_perf:inputs/test_rect1_dense.perf

FreeBSD 11.2 amd64

Add a check in the makefile to verify is fypp is installed

This check is required when people download DBCSR as a zip file and they don't get fypp included (since it is a submodule). In this case, a message like:
Fypp is not installed in the default position ..., please install it and use make FYPPEXE=<> ... to point to it.

Remove path dependency in the GPU kernel generation

@shoshijak
Currently, we are hard-coding the path to the kernels (in the file libcusmm.cpp) by setting the external macro DBCSRHOME.

#define KERNEL_FILES_PATH DBCSRHOME "/src/acc/libsmm_acc/libcusmm/kernels/"

The reason for that is because we set the include directory:

const std::string kernel_files_path = KERNEL_FILES_PATH;
const std::string include_opt = "-I=" + kernel_files_path;

The only header I see is cusmm_common.h, which is a bunch of macro definitions.
Would it be possible to hack the

generate_kernels.py

and embed the entire cusmm_common.h in a variable, such that we don't need any include (and path) anymore? Then we can obtain the kernel codes by adding the header string + the kernel string...

CMake: Parallel build issue for tests

i know that cmake is experimental, nonetheless I think the errors are not CMake specific:

f951: Fatal Error: Can't rename module file ‘dbcsr_test_add.mod0’ to ‘dbcsr_test_add.mod’: No such file or directory
f951: Fatal Error: Can't rename module file ‘dbcsr_test_multiply.mod0’ to ‘dbcsr_test_multiply.mod’: No such file or directory
Error copying Fortran module "tests/dbcsr_test_add.mod".  Tried "tests/DBCSR_TEST_ADD.mod" and "tests/dbcsr_test_add.mod".

I am building on macOS with [email protected] and default make. I will try gnu make to see if there is any difference...

dbcsr_api_c.F: broken with gcc-4.8.5 due to OPTIONAL in BIND(C) procedure

mpifort -c -O3 -g -fno-omit-frame-pointer -funroll-loops -ffast-math -fopenmp -D__MPI_VERSION=3 -D__parallel -D__SCALAPACK -D__STATM_TOTAL -ffree-form -std=f2003 -fimplicit-none -ffree-line-length-512 -Werror=aliasing -Werror=ampersand -Werror=c-binding-type -Werror=intrinsic-shadow -Werror=intrinsics-std -Werror=line-truncation -Werror=tabs -Werror=realloc-lhs-all -Werror=target-lifetime -Werror=underflow -Werror=unused-but-set-variable -Werror=unused-variable -Werror=unused-dummy-argument -Werror=conversion  -D__SHORT_FILE__="\"dbcsr_tensor_types.F\"" -I'/users/tiziano/work/dbcsr/dbcsr-fork/src/tensors/' -I'/users/tiziano/work/dbcsr/dbcsr-fork/src' dbcsr_tensor_types.F90 
/users/tiziano/work/dbcsr/dbcsr-fork/src/dbcsr_api_c.F:229.116:

a, transb, alpha, c_matrix_a, c_matrix_b, beta, c_matrix_c, retain_sparsity) &
                                                                           1
Error: TS 29113: Variable 'retain_sparsity' at (1) with OPTIONAL attribute in procedure 'c_dbcsr_multiply_d' which is BIND(C)

I guess this is intentional and requires 2008+TS support in the compiler, but there is no hint in the README about that fact.

Update file headers

Following the discussion on the CP2K-Dev ML: I think we should update the file headers at some point:

  1. most of them only mention CP2K, not DBCSR specifically
  2. many mention a copyright, but no license

Example from current:

dbcsr/src/dbcsr_api.F

Lines 1 to 4 in a6eea29

!--------------------------------------------------------------------------------------------------!
! CP2K: A general program to perform molecular dynamics simulations !
! Copyright (C) 2000 - 2018 CP2K developers group !
!--------------------------------------------------------------------------------------------------!

For the first part we need a good wording which declares DBCSR as its own project but retains the link to CP2K. For the second part I guess it would make sense to use a SPDX License Identifier

Bug in Cannon

Bug in Cannon introduced with the commit for improving multiplications with virtual images. The bug appears on PizDaint (Ilia found it) when a square grid is used

Multiple definition of constexpr symbols when used in different compilation units

cd dbcsr/
echo '#include <dbcsr.h>' > test_a.cc
echo '#include <dbcsr.h>' > test_b.cc
mpic++ -Isrc -c test_{a,b}.cc
mpic++ -shared -o test_api.so test_{a,b}.o

gives

/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o: in function `c_dbcsr_finalize_lib':
test_b.cc:(.text+0x0): multiple definition of `c_dbcsr_finalize_lib'; test_a.o:test_a.cc:(.text+0x0): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o: in function `c_dbcsr_distribution_new':
test_b.cc:(.text+0x2a): multiple definition of `c_dbcsr_distribution_new'; test_a.o:test_a.cc:(.text+0x2a): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x240): multiple definition of `dbcsr::init_lib'; test_a.o:(.rodata+0x240): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x248): multiple definition of `dbcsr::finalize_lib'; test_a.o:(.rodata+0x248): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x250): multiple definition of `dbcsr::distribution_new'; test_a.o:(.rodata+0x250): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x258): multiple definition of `dbcsr::distribution_release'; test_a.o:(.rodata+0x258): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x260): multiple definition of `dbcsr::create_new_d'; test_a.o:(.rodata+0x260): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x268): multiple definition of `dbcsr::finalize'; test_a.o:(.rodata+0x268): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x270): multiple definition of `dbcsr::release'; test_a.o:(.rodata+0x270): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x278): multiple definition of `dbcsr::print'; test_a.o:(.rodata+0x278): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x280): multiple definition of `dbcsr::get_stored_coordinates'; test_a.o:(.rodata+0x280): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x288): multiple definition of `dbcsr::put_block_d'; test_a.o:(.rodata+0x288): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o:(.rodata+0x290): multiple definition of `dbcsr::multiply_d'; test_a.o:(.rodata+0x290): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_a.o: relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: test_b.o: relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: final link failed: nonrepresentable section on output
collect2: error: ld returned 1 exit status

reported by @jkn93 in PR #64

Seems we have to come up with a different idea on how to alias those functions in the header. Please note that according to the C++ standard (7.1.5) a constexpr function is being turned into an implicit inline function.

DBCSR does not compile with enabled CUBLAS.

The error looks like acc_devmem_cptr does not work (it converts TYPE(acc_devmem_type) to c-ptr)
...dbcsr/src/mm/dbcsr_acc_operations.F:225:54:

   istat = libsmm_acc_transpose_cu(acc_devmem_cptr(trs_stack), &
                                                  1

Error: Type mismatch in argument 'this' at (1); passed TYPE(acc_devmem_type) to INTEGER(4)

dbcsr_test_csr_conversions.x hangs when built using Intel toolchain

Summary
I have attempted building dbcsr with different toolchains. Using GCC/OpenMPI/refblas/reflapack, I have no issues running all the tests with make test. However, when I use the Intel compiler, Intel MPI and MKL, dbcsr_test_csr_conversions.x hangs forever inside dbcsr_zero. The other tests seem to work fine.

Details
I'm using the Intel compiler and libraries as included in Parallel Studio XE 2018. I checked out the current head of the develop branch (7070fc8) and patched Makefile.inc as follows to use the Intel toolchain:

@@ -1,6 +1,6 @@
 
-CC          = gcc
-CPP         =
-FC          = mpifort
-LD          = mpifort
+CC          = mpiicc
+CPP         = mpiicc
+FC          = mpiifort
+LD          = mpiifort
 AR          = ar -r
@@ -32,3 +32,3 @@ endif
 LDFLAGS     = $(FCFLAGS)
-LIBS        = -L${SCALAPACK_PATH}/lib -lscalapack -lreflapack -lrefblas
+LIBS        = -mkl

Building throws lots of the following warnings but completes successfully.

ifort: command line warning #10006: ignoring unknown option '-ffree-form'
ifort: command line warning #10006: ignoring unknown option '-std=f2003'
ifort: command line warning #10006: ignoring unknown option '-fimplicit-none'
ifort: command line warning #10006: ignoring unknown option '-ffree-line-length-512'
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong type

make test however stalls when running dbcsr_test_csr_conversions.x and consumes 400% CPU (determined by $OMP_NUM_THREADS). gdb shows the following:

dbcsr_zero (matrix_a=<error reading variable: Cannot access memory at address 0x1>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.F:264
264	         matrix_a%data_area%d%r_dp = 0.0_dp
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 numactl-devel-2.0.9-5.el7_1.x86_64
(gdb) info threads
  Id   Target Id         Frame 
  4    Thread 0x2aaab284b780 (LWP 7242) "dbcsr_test_csr_" 0x0000000000596cd7 in dbcsr_zero (matrix_a=<error reading variable: Cannot access memory at address 0x1f>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.F:264
  3    Thread 0x2aaab2c4c800 (LWP 7243) "dbcsr_test_csr_" 0x0000000000596cb7 in dbcsr_zero (matrix_a=<error reading variable: Cannot access memory at address 0x3d>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.F:264
  2    Thread 0x2aaabb04d880 (LWP 7244) "dbcsr_test_csr_" 0x0000000000596cbe in dbcsr_zero (matrix_a=<error reading variable: Cannot access memory at address 0x5b>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.F:264
* 1    Thread 0x2aaaaaae9200 (LWP 7237) "dbcsr_test_csr_" dbcsr_zero (matrix_a=<error reading variable: Cannot access memory at address 0x1>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.F:264
(gdb) bt
#0  dbcsr_zero (matrix_a=<error reading variable: Cannot access memory at address 0x1>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.F:264
#1  dbcsr_operations::dbcsr_set_d (matrix=<error reading variable: Cannot access memory at address 0x1e>, alpha=<error reading variable: Cannot access memory at address 0x1>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.f90:258
#2  0x00002aaaaf2b1ac3 in __kmp_invoke_microtask () from /upb/departments/pc2/software/INTEL/ps_xe_2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64/libiomp5.so
#3  0x00002aaaaf280257 in __kmp_invoke_task_func (gtid=30) at ../../src/kmp_runtime.cpp:7138
#4  0x00002aaaaf281498 in __kmp_fork_call (loc=0x1e, gtid=1, call_context=fork_context_gnu, argc=-1318061824, microtask=0x1, invoker=0x0, ap=0x7fffffff7f80) at ../../src/kmp_runtime.cpp:2406
#5  0x00002aaaaf2587de in __kmpc_fork_call (loc=0x1e, argc=1, microtask=0x0) at ../../src/kmp_csupport.cpp:341
#6  0x0000000000596e07 in dbcsr_zero (matrix_a=...) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.F:263
#7  dbcsr_operations::dbcsr_set_d (matrix=<error reading variable: Cannot access memory at address 0x1e>, alpha=<error reading variable: Cannot access memory at address 0x1>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_operations.f90:258
#8  0x000000000055d695 in dbcsr_transformations::dbcsr_complete_redistribute (matrix=<error reading variable: Cannot access memory at address 0x1e>, redist=<error reading variable: Cannot access memory at address 0x1>, keep_sparsity=<error reading variable: Cannot access memory at address 0x0>)
    at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_transformations.F:1640
#9  0x0000000000572326 in dbcsr_csr_conversions::convert_dbcsr_to_csr (dbcsr_mat=<error reading variable: Cannot access memory at address 0x1e>, csr_mat=<error reading variable: Cannot access memory at address 0x1>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/src/ops/dbcsr_csr_conversions.F:1122
#10 0x0000000000407d6f in csr_conversion_test (dbcsr_mat=..., csr_mat=..., norm=<optimized out>, eps=<optimized out>) at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/tests/dbcsr_test_csr_conversions.F:161
#11 dbcsr_test_csr_conversions () at /scratch/pc2-mitarbeiter/lass/priv/git/dbcsr.intel/tests/dbcsr_test_csr_conversions.F:113
#12 0x0000000000406c1e in main ()
#13 0x00002aaab0fc6c05 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000406b29 in _start ()

Suspecting a threading issue I already attempted linking against sequential mkl using -mkl=sequential and setting $OMP_NUM_THREADS to 1. However, this did not help.

Do you have any idea what could be the issue here or how to proceed in debugging this?

Fortran runtime error

I get the following runtime error when trying to run cpp example:

At line 127 of file <dbcsr>/src/base/dbcsr_machine_posix.f90 (unit = 121245)
Fortran runtime error: Cannot open file '/proc/self/statm': No such file or directory

dbcsr_make_undense works strange

CALL dbcsr_make_undense(product_matrix, matrix_c_local, &

in dbcsr_make_undense number of blocks (% nblks ) is copied from distribution nblkrows_local*nblkcols_local. This gives wrong number of blocks for the matrix which has 0 blocks on a current rank.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.