rocm / rocsparse Goto Github PK

Next generation SPARSE implementation for ROCm platform

Home Page: https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/

License: MIT License

CMake 0.63% C++ 86.29% C 8.46% Shell 0.52% Python 0.75% Groovy 0.17% Fortran 3.18%

rocsparse's Introduction

rocSPARSE

rocSPARSE exposes a common interface that provides Basic Linear Algebra Subroutines (BLAS) for sparse computation. It's implemented on top of AMD ROCm runtime and toolchains. rocSPARSE is created using the HIP programming language and optimized for AMD's latest discrete GPUs.

Documentation

Documentation for rocSPARSE is available at https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/.

To build our documentation locally, run the following code:

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Alternatively, build with CMake:

cmake -DBUILD_DOCS=ON ...

Requirements

Git
CMake (3.5 or later)
AMD [ROCm] 3.5 platform or later

Optional:

GoogleTest (required only for tests)
- Use GTEST_ROOT to specify a location
- If you don't have GoogleTest installed, CMake automatically downloads and builds it

Build and install

Build rocSPARSE using the install.sh script.

# Clone rocSPARSE using git
git clone https://github.com/ROCm/rocSPARSE.git

# Go to rocSPARSE directory
cd rocSPARSE

# Run install.sh script
# Command line options:
#   -h|--help         - prints help message
#   -i|--install      - install after build
#   -d|--dependencies - install build dependencies
#   -c|--clients      - build library clients too (combines with -i & -d)
#   -g|--debug        - build with debug flag
./install.sh -dci

Compile rocSPARSE (all compiler specifications are automatically determined).

# Clone rocSPARSE using git
git clone https://github.com/ROCm/rocSPARSE.git

# Go to rocSPARSE directory, create and go to the build directory
cd rocSPARSE; mkdir -p build/release; cd build/release

# Configure rocSPARSE
# Build options:
#   BUILD_CLIENTS_TESTS      - build tests (OFF)
#   BUILD_CLIENTS_BENCHMARKS - build benchmarks (OFF)
#   BUILD_CLIENTS_SAMPLES    - build examples (ON)
#   BUILD_VERBOSE            - verbose output (OFF)
#   BUILD_SHARED_LIBS        - build rocSPARSE as a shared library (ON)
CXX=/opt/rocm/bin/hipcc cmake -DBUILD_CLIENTS_TESTS=ON ../..

# Build
make

# Install
[sudo] make install

Unit tests and benchmarks

To run unit tests, you must build rocSPARSE with -DBUILD_CLIENTS_TESTS=ON.

# Go to rocSPARSE build directory
cd rocSPARSE; cd build/release

# Run all tests
./clients/staging/rocsparse-test

To run benchmarks, you must build rocSPARSE with -DBUILD_CLIENTS_BENCHMARKS=ON.

# Go to rocSPARSE build directory
cd rocSPARSE/build/release

# Run benchmark, e.g.
./clients/staging/rocsparse-bench -f hybmv --laplacian-dim 2000 -i 200

Issues

To submit an issue, a bug, or a feature request, use the GitHub issue tracker.

License

Our license file is located in the main repository.

rocsparse's People

Contributors

Stargazers

Watchers

Forkers

pramenku shidong-ai amdkila hephaex alexbrownamd yoichiyoshida saadrahim yvanmokwinski streamhsa eidenyoshida rocmsys jsandham parallelo pruthvistony chsankar tpkessler vin-huang xuhuisheng torrezuk chengxinguo scchan ntrost57 arjun-raj-kuppala jaynotleno lawruble13 bearwithdog pavahora nunnikri ncic-alphasparse lzjdasiq voldmare jpola-amd eddazevedo arvindcheru streamhpc mtxuhao lajagapp yaotc samjwu victorapm angeloo01 cjatin trixirt nathanlzc raramakr bujset ashutom abhimeda peterjunpark evetsso dkrottap krapivn

rocsparse's Issues

./clients/staging/rocsparse-bench terminate called after throwing an instance of 'boost::wrapexcept<std::logic_error>' what(): character conversion failed Aborted (core dumped)

What is the expected behavior

to run

What actually happens

./clients/staging/rocsparse-bench
terminate called after throwing an instance of 'boost::wrapexceptstd::logic_error'
what(): character conversion failed
Aborted (core dumped)

How to reproduce

./clients/staging/rocsparse-bench
terminate called after throwing an instance of 'boost::wrapexceptstd::logic_error'
what(): character conversion failed
Aborted (core dumped)

Environment

Hardware	description
GPU	device string
CPU	device string

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	v0.0
Library	v0.0


root@fastmmw:/home/paolo/FastMM/Epyc/rocALUTION/build# /opt/rocm/bin/rocminfo 
ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper 1950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper 1950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65775068(0x3eba5dc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65775068(0x3eba5dc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-69d2708172fc1a8c               
  Marketing Name:          Vega 20 [Radeon VII]               
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26287(0x66af)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   2816                               
  Internal Node ID:        1                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Ellesmere [Radeon Pro WX 7100]     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26564(0x67c4)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1243                               
  BDFID:                   17408                              
  Internal Node ID:        2                                  
  Compute Unit:            36                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16777216(0x1000000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 4                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Ellesmere [Radeon Pro WX 7100]     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26564(0x67c4)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1243                               
  BDFID:                   17664                              
  Internal Node ID:        3                                  
  Compute Unit:            36                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16777216(0x1000000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

 ldd ./clients/staging/rocsparse-bench 
	linux-vdso.so.1 (0x00007ffe613cb000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f24a8afa000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f24a8ad7000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f24a8988000)
	libboost_program_options.so.1.71.0 => /usr/lib/x86_64-linux-gnu/libboost_program_options.so.1.71.0 (0x00007f24a88f9000)
	librocsparse.so.0 => /home/paolo/FastMM/Epyc/rocSPARSE/build/release/library/librocsparse.so.0 (0x00007f24a0c18000)
	libamdhip64.so.4 => /opt/rocm/lib/libamdhip64.so.4 (0x00007f249fc90000)
	libomp.so => /opt/rocm-4.0.0/llvm/lib/libomp.so (0x00007f249f9a6000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f249f7c5000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f249f5d3000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f24a8b39000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f249f5cb000)
	libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 (0x00007f249f1b6000)
	libamd_comgr.so.1 => /opt/rocm/lib/libamd_comgr.so.1 (0x00007f249867f000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2498674000)
	libhsakmt.so.1 => /opt/rocm/lib/libhsakmt.so.1 (0x00007f249844b000)
	libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f249842d000)
	libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f24983ff000)
	libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f24983f2000)
	libz.so.1 => /usr/local/lib/libz.so.1 (0x00007f24981d4000)

 git branch
* develop
  rocm-4.0.x

rocsparse-test spgemm_bsr.extra: Assertion `format == rocsparse_format_csc' failed.

What is the expected behavior

Test pass

What actually happens

/clients/tests/../include/utility.hpp:484: rocsparse_local_spmat::rocsparse_local_spmat(int64_t, int64_t, int64_t, void *, void *, void *, rocsparse_indextype, rocsparse_indextype, rocsparse_index_base, rocsparse_datatype, rocsparse_format): Assertion `format == rocsparse_format_csc' failed.

By backtracing, it seems that https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/rocm-5.7.1/clients/testings/testing_spgemm_bsr.cpp#L76 is calling https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/rocm-5.7.1/clients/include/utility.hpp#L455, but not https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/rocm-5.7.1/clients/include/utility.hpp#L520, which I cannot understand

The full build and test log (Gentoo rocSPARSE package) is attached:

build.log.gz

How to reproduce

Run ./rocsparse-test --gtest_filter=*pre_checkin/spgemm_bsr.extra/i32_i32_f32_r_bad_arg*

Environment

Hardware	description
GPU	AMD Instinct MI210
CPU	AMD EPYC 7763 64-Core Processor

Software	version
ROCK	linux-6.1 (Debian 6.1.27-1)
ROCR	rocm-5.7.1
HIP	rocm-5.7.1
Library	rocm-5.7.1

Not able to install libcxxtools9

I am getting this error when I install it using ./install -icd

Continue? [y/n/v/...? shows all options] (y): y
Retrieving package libcxxtools9-2.2.1-11.1.x86_64 (1/1), 325.0 KiB ( 1.2 MiB unpacked)
Retrieving: libcxxtools9-2.2.1-11.1.x86_64.rpm ......................................................................................................................................................................................................................[not found]
File './x86_64/libcxxtools9-2.2.1-11.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/server:/http/SLE_15/'
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.

csrmv behaves inconsistently with zero matrices if passing nullptr data

What is the expected behavior

hipsparseDcsrmv on an empty matrix (all zero row pointers) does not care about what is passed as column index and value array.

What actually happens

hipsparseDcsrmv computes the correct result (0) if we pass any non-null pointer, but leaves the output unmodified if we pass null pointers for column indices and values.

How to reproduce

Compile the following code and run it with either of the two hipsparseDcsrmv invocations. They should produce equivalent results, but one writes 0 to memory, the other one leaves the 0xCDCDCDCD values unchanged.

#include <hipsparse.h>
#include <iomanip>
#include <iostream>

int main() {
  double *mem;
  int *zeros;
  double alpha{1.0};
  double beta{0.0};
  hipsparseHandle_t handle;
  hipMalloc(&mem, sizeof(double) * 3);
  hipMalloc(&zeros, sizeof(int) * 2);
  hipMemset(zeros, 0, sizeof(int) * 2);
  hipMemset(mem, 0xCD, sizeof(double) * 3);
  hipsparseCreate(&handle);
  hipsparseSetPointerMode(handle, HIPSPARSE_POINTER_MODE_HOST);
  hipsparseMatDescr_t descr;
  hipsparseCreateMatDescr(&descr);
#if 1
  hipsparseDcsrmv(handle, HIPSPARSE_OPERATION_NON_TRANSPOSE, 1, 1, 1, &alpha,
                  descr, nullptr, zeros, nullptr, mem, &beta, mem + 2);
#else
  hipsparseDcsrmv(handle, HIPSPARSE_OPERATION_NON_TRANSPOSE, 1, 1, 1, &alpha,
                  descr, (double *)1, zeros, (int *)1, mem, &beta, mem + 2);
#endif
  long result{};
  hipMemcpy(&result, mem + 2, sizeof(long), hipMemcpyDeviceToHost);
  std::cout << std::hex << result << std::endl;
  hipsparseDestroyMatDescr(descr);
  hipsparseDestroy(handle);
  hipFree(zeros);
  hipFree(mem);
}

Environment

Hardware	description
GPU	Radeon VII
CPU	AMD Ryzen Threadripper 1920X 12-Core Processor

Software	version
rocSPARSE	v1.20.2

rocsparse_spsm ignores dense matrix order

What is the expected behavior

In the rocsparse_spsm function (link) (and maybe other functins as well), the memory order of the dense matrices (B and C) should be taken into account.

The minimum I expect: check if the orders of B and C are equal (if not then error). If the order is rowmajor, change the transB value to the opposite one and swap numrows and numcols.

What actually happens

The memory order of the dense matrices is completely ignored, and is assumed to be column-major, which is not explicitly mentioned in the documentation.

How to reproduce

I dont't have a simple example, but I think it is not necessary, since I can clearly see that the order is ignored in the code. Will provide if needed.

Environment

I am using rocsparse that comes with rocm-5.2.3, but I can see the same problem in the sources for rocm-6.0.0

Question about the cmake workflow when including it in an external project.

What is the expected behavior

I wrote a cmake fetch content script to fetch a version of rocSparse through GitHub instead of relying on the pre-installed version on your system (through rocm-lib). The script looks like the following (FetchROCSparse.cmake):

include(FetchContent)
set(FETCHCONTENT_QUIET ON)

message(STATUS "Cloning External Project: Rocsparse")
get_filename_component(FC_BASE "../externals"
                REALPATH BASE_DIR "${CMAKE_BINARY_DIR}")
set(FETCHCONTENT_BASE_DIR ${FC_BASE})

set(CMAKE_CXX_COMPILER_ID "Clang")
set(BUILD_CLIENTS_SAMPLES OFF)

FetchContent_Declare(
    rocsparse
    GIT_REPOSITORY https://github.com/ROCmSoftwarePlatform/rocSPARSE.git
    GIT_TAG        rocm-5.4.2
)

FetchContent_GetProperties(rocsparse)
if(NOT rocsparse_POPULATED)
  FetchContent_Populate(
    rocsparse
  )
endif()
# Exposing rocsparse's source and include directory
set(ROCSPARSE_SOURCE_DIR "${rocsparse_SOURCE_DIR}")
set(ROCSPARSE_BUILD_DIR "${rocsparse_BINARY_DIR}")

# Add subdirectory ::rocsparse
add_subdirectory(${ROCSPARSE_SOURCE_DIR})

The output of this is:

-- Cloning External Project: Rocsparse
-- The Fortran compiler identification is Flang 99.99.1
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /opt/rocm/llvm/bin/flang - skipped
-- Using hip-clang to build for amdgpu backend
CMake Warning (dev) at externals/rocsparse-src/CMakeLists.txt:72 (option):
  Policy CMP0077 is not set: option() honors normal variables.  Run "cmake
  --help-policy CMP0077" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

  For compatibility with older versions of CMake, option is clearing the
  normal variable 'BUILD_CLIENTS_SAMPLES'.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test COMPILER_HAS_TARGET_ID_gfx803
-- Performing Test COMPILER_HAS_TARGET_ID_gfx803 - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx900_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx900_xnack_off - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx906_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx906_xnack_off - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1030
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1030 - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1100 - Failed
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1102
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1102 - Failed
-- AMDGPU_TARGETS: gfx900;gfx906;gfx908;gfx90a;gfx1030
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Failed
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Backward Compatible Sym Link Created for include directories

Notice how it is trying to add the following targets: AMDGPU_TARGETS: gfx900;gfx906;gfx908;gfx90a;gfx1030, the problem is exactly this. Even when I override the AMDGPU_TARGETS variable, it still tries to add all of those targets and fails compilation.

What actually happens

What I would like to do is just specify which target I want, for example:

set(CMAKE_HIP_ARCHITECTURES gfx908)
set(AMDGPU_TARGETS ${CMAKE_HIP_ARCHITECTURES} FORCE)

Or better yet, why does it not automatically rely on CMAKE_HIP_ARCHITECTURES or CMAKE_CUDA_ARCHITECTURES?

How to reproduce

Please see the script above.

Environment

Everything is running 5.4.2 (ROCM/HIP/rocSparse).

Alternatively, if you can recommend a better approach to include rocSparse (build from source) in our external projects? Thank you, any help here is appreciated!

rocsparse_dense2csr creates CSC matrix

What is the expected behavior

Converting a dense matrix to CSR format using rocsparse_dense2csr() should result in a matrix encoded in CSR format.

What actually happens

The resulting matrix is encoded in CSC format.

How to reproduce

Convert a dense matrix to CSR format using rocsparse and look at result. I attached a test code that shows this behavior. It returns the following output:

Matrix:
  1  2  3  4  5
  6  7  8  9 10
 11 12 13 14 15
 16 17 18 19 20
 21 22 23 24 25

row_ptr = 0, 5, 10, 15, 20, 25
col_ind = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 
val = 1 6 11 16 21 2 7 12 17 22 3 8 13 18 23 4 9 14 19 24 5 10 15 20 25

Environment

Hardware	description
GPU	MI100
CPU	AMD 7763

Software	version
ROCM	v5.7.0

Attachments

csrcsc.zip

rocsparse_spsm compute stage synchronization

What is the expected behavior

compute stage of rocsparse_spsm executes asynchronously. The rocsparse_spsm call should take next to no time when stage = compute.

What actually happens

the rocsparse_spsm function with stage = compute seems to wait for previous operations in the stream to finish, and therefore takes a long time

How to reproduce

I was testing it on LUMI-G. Load modules with

module load LUMI/22.08 rocm/5.2.3

Compile the program source.cpp.txt with

hipcc -g -O2 -fopenmp --offload-arch=gfx90a:sramecc+:xnack- source.cpp -o program.x -lrocsparse

(remove the .txt extension from the source file, github did not allow me to upload .cpp file...) and run with ./program.x

It creates an upper-triangular matrix full of ones in CSR format, copies it to the device, creates the right-hand-side and solution matrices (single-column), performs the three spsm stages and deletes everything. There are timers for the spsm stages to see how long they take to submit and execute. I also added a dummy kernel launch before the calls to highlight that in the compute stage, there is a synchronization somewhere.

The output of the reproducer program was the following for my tests:

Allocate memory
Populate matrix in host memory
Copy matrix to device memory
Set vectors
Initialize sparse handle and descriptors
Buffersize
  Submitting long kernel
  Long kernel submitted
  Submitting the buffersize
  Buffersize submitted: 0.012159 ms
  Waiting for buffersize to finish
  Buffersize finished: 3417.359114 ms
Allocate buffer
Preprocess
  Submitting long kernel
  Long kernel submitted
  Submitting the preprocess
  Preprocess submitted: 3426.814079 ms
  Waiting for preprocess to finish
  Preprocess finished: 0.009060 ms
Compute
  Submitting long kernel
  Long kernel submitted
  Submitting the compute
  Compute submitted: 3432.324886 ms
  Waiting for compute to finish
  Compute finished: 390.285015 ms
Waiting for everything to finish
Everything finished: 0.000000 ms
Destroy sparse handle and descriptors
Free memory
Program finished

Notice the Compute submitted: 3432.324886 ms line, which states, that the compute stage took almost 3.5 seconds to submit - it was waiting for the previous kernel to finish.

Environment

LUMI supercomputer, gpu node, SUSE Linux Enterprise Server 15 SP4, AMD EPYC 7A53 64-core, MI250X gfx90a:sramecc+:xnack-

rocm-5.2.3

Additionally, I would like to ask, does the preprocessing stage of spsm and spsv really have to be synchronous? The function itself (not including the synchronization) does not take much time per my measurements, so it is not a huge deal, but it would be nice if it was asynchronous too.

Thanks,

Jakub

memory access violation during rocsparse-test on Amethyst XT

I tried to follow the installation instruction for a AMD Tonga card on my Laptop.
I successfully obtained:

$ rocminfo   | grep Name
  Name:                    Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
  Marketing Name:          Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
  Vendor Name:             CPU                                
  Name:                    gfx802                             
  Marketing Name:          Amethyst XT [Radeon R9 M295X / M390X]
  Vendor Name:             AMD                                
  Name:                    amdgcn-amd-amdhsa--gfx802

The band width test also gives reasonable results

$ rocm-bandwidth-test | grep "Device:"
  Device: 0,  Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
  Device: 1,  Amethyst XT [Radeon R9 M295X / M390X],  01:0.0

$ rocm-bandwidth-test | tail

          Bdirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         16.868657   

          1         16.868657   N/A

After installing rocSPARSE and starting ./clients/staging/rocsparse-test the tests run rather long finally stopping with a memory access violation:

Query device success: there are 1 devices
Device ID 0: Amethyst XT [Radeon R9 M295X / M390X]
-------------------------------------------------------------------------
with 4096MB memory, clock rate 723MHz @ computing capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 64KB, maxThreadsPerBlock 1024
wavefrontSize 64
-------------------------------------------------------------------------
Using device ID 0 (Amethyst XT [Radeon R9 M295X / M390X]) for rocSPARSE
-------------------------------------------------------------------------
rocSPARSE version: 1.19.3-916-e69bb42
[==========] Running 119628 tests from 201 test cases.
[----------] Global test environment set-up.
[----------] 192 tests from quick/axpby
[ RUN      ] quick/axpby.level1/i32_f32_r_1200_5_1_0_1_0_0b
[       OK ] quick/axpby.level1/i32_f32_r_1200_5_1_0_1_0_0b (22 ms)
[ RUN      ] quick/axpby.level1/i32_f32_r_1200_10_1_0_1_0_0b
[       OK ] quick/axpby.level1/i32_f32_r_1200_10_1_0_1_0_0b (1 ms)
......
......
[ RUN      ] pre_checkin/spmv_ell.level2/i32_f32_r_0_4441_2_0_0_0_NT_1b_rand
[       OK ] pre_checkin/spmv_ell.level2/i32_f32_r_0_4441_2_0_0_0_NT_1b_rand (1 ms)
[ RUN      ] pre_checkin/spmv_ell.level2/i32_f32_r_0_10000_2_0_0_0_NT_1b_rand
[       OK ] pre_checkin/spmv_ell.level2/i32_f32_r_0_10000_2_0_0_0_NT_1b_rand (2 ms)
[ RUN      ] pre_checkin/spmv_ell.level2/i32_f32_r_7111_0_2_0_0_0_NT_1b_rand
Speicherzugriffsfehler (Speicherabzug geschrieben)

I installed the sources from the develop branch of yesterday and changed the CMakeLists.txt as suggested in posts.
If there's anything I can do helping to fix this issue,please let me know.

Feature request: Separate workspace buffers to persistent and temporary

As I can see in the implementation of e.g. the rocsparse_Xcsrsm_buffer_size function (e.g. here, there are two kinds of memory needed in a buffer.

One part of the buffer memory is to store some helper data about the CSR matrix itself, e.g. the transposed matrix. This memory is initialized once in the analysis stage, and then just read in the solve stage. It is persistent between multiple calls to the solve stage.

The other part of the buffer memory is used in the solve stage only. As I can see in the function I linked, you might need additional memory of the size of the RHS matrix (for a possible transposed copy of B). This memory is only used in the solve stage as a temporary buffer, and the data it contains is not needed before or after the solve stage, the memory can be reused for other things (e.g. a solve with a different matrix).

(am I right, does it work this way?)

My request is to split these two kinds of workspace buffers to two, so that I don't have to keep allocated the whole buffer with the huge RHS for multiple CSR matrices at the same time.

In my use case, I have a set of ~100 of CSR matrices with their RHSs (~20000 size of matrix, ~4000 righ-hand-sides per matrix). csrsm buffer_size reports that it requires ~600 MB of memory, which totals ~60 GB for all matrices (just barely enough to fit on a MI250X GCD, not counting the original matrix data). I sequentially loop through all these matrices and perform the analysis phase. Then, I loop throuh them again and perform the solve phase. This solve loop is again in a loop, so each matrix is used multiple times for solve (so separating the analysis and solve makes sense, I don't want to re-analyze the matrix before each solve, even though it is quite fast now). As you can see, the workspace buffers for the RHSs do not all need to be allocated at the same time, I could just allocate the largest one and use it for all solves.

(my actual use case is slightly different, but this demonstrates the main point I am trying to make)

I understand this is a very specific use case and large matrices, and it would add another boilerplate code to using rocsparse, but I think that separating the workspace buffers (in all functions, not just csrsm) might have other uses than mine.

Edit:

So after reading through the source code of rocsparse more carefully, my memory problem can be solved by just using a transposed RHS matrix B.

Although it is an implementation detail, I think these things (like "the function will need a large workspace buffer if transB==none" or "this function is faster if transA==trans") should be mentioned in the documentation - after all, this is HPC, we care about performance and memory; and don't want to test all possible variants of calling the function to find the fastest.

Anyway, this moves my feature request to a very low priority for me. So feel free to close this issue after you read it :)

(and sorry for using this issue as a rubber ducky)

Build issue - 'Label field of continuation line is not blank' in Fortran files

What is the expected behavior

Build completes successfully. Renaming the .f files to .f90 fixes all issues - see pull request #226 .

What actually happens

Build fails. Below you may find the output from CMake:

-- The CXX compiler identification is Clang 13.0.0
-- The Fortran compiler identification is Flang 99.99.1
-- Check for working CXX compiler: /opt/rocm/bin/hipcc
-- Check for working CXX compiler: /opt/rocm/bin/hipcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working Fortran compiler: /opt/rocm/llvm/bin/flang
-- Check for working Fortran compiler: /opt/rocm/llvm/bin/flang  -- works
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Checking whether /opt/rocm/llvm/bin/flang supports Fortran 90
-- Checking whether /opt/rocm/llvm/bin/flang supports Fortran 90 -- yes
-- Using hip-clang to build for amdgpu backend
-- Found Git: /usr/bin/git (found version "2.25.1") 
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is Clang 13.0.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rocm/bin/hipcc
-- Check for working CXX compiler: /opt/rocm/bin/hipcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /rocm-build/build/rocsparse/extern/rocm-cmake-master
-- GLOB mismatch!
-- Configuring done
-- Generating done
-- Build files have been written to: /rocm-build/build/rocsparse/extern/rocm-cmake-master
Install the project...
-- Install configuration: ""
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMClangTidy.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMInstallSymlinks.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMInstallTargets.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMPackageConfigHelpers.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMCppCheck.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMCheckTargetIds.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMChecks.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMCreatePackage.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMAnalyzers.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMConfig.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/./share/rocm/cmake/ROCMSetupVersion.cmake
-- Installing: /rocm-build/build/rocsparse/extern/rocm-cmake/share/rocm/cmake/ROCMConfigVersion.cmake
sh: 1: lsmod: not found
sh: 1: lsmod: not found
-- Performing Test HAVE_gfx803
-- Performing Test HAVE_gfx803 - Success
-- Performing Test HAVE_gfx900:xnack-
-- Performing Test HAVE_gfx900:xnack- - Success
-- Performing Test HAVE_gfx906:xnack-
-- Performing Test HAVE_gfx906:xnack- - Success
-- Performing Test HAVE_gfx908:xnack-
-- Performing Test HAVE_gfx908:xnack- - Success
-- Performing Test HAVE_gfx90a:xnack-
-- Performing Test HAVE_gfx90a:xnack- - Success
-- Performing Test HAVE_gfx90a:xnack+
-- Performing Test HAVE_gfx90a:xnack+ - Success
-- Performing Test HAVE_gfx1030
-- Performing Test HAVE_gfx1030 - Success
sh: 1: lsmod: not found
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- ROCclr at /opt/rocm/lib/cmake/rocclr
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Success
sh: 1: lsmod: not found
-- ROCclr at /opt/rocm/lib/cmake/rocclr
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /rocm-build/build/rocsparse
[1/137] Building Fortran preprocessed library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse_enums.f-pp.f
clang-13: warning: argument unused during compilation: '-std=f2003' [-Wunused-command-line-argument]
clang-13: warning: argument unused during compilation: '-ffree-form' [-Wunused-command-line-argument]
[2/137] Building Fortran preprocessed library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse.f-pp.f
clang-13: warning: argument unused during compilation: '-std=f2003' [-Wunused-command-line-argument]
clang-13: warning: argument unused during compilation: '-ffree-form' [-Wunused-command-line-argument]
[4/137] Building Fortran object library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse_enums.f.o
FAILED: library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse_enums.f.o include/rocsparse_enums.mod 
/opt/rocm/llvm/bin/flang  -I//rocSPARSE/library/src -O3 -DNDEBUG -Jinclude   -std=f2003 -ffree-form -cpp -c library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse_enums.f-pp.f -o library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse_enums.f.o
clang-13: warning: argument unused during compilation: '-ffree-form' [-Wunused-command-line-argument]
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 24)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 25)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 32)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 33)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 34)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 36)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 39)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 35)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 40)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 42)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 45)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 41)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 46)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 47)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 48)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 50)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 53)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 49)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 54)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 56)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 59)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 55)
F90-S-0034-Syntax error at or near :: (/rocSPARSE/library/src/rocsparse_enums.f: 60)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 62)
F90-S-0021-Label field of continuation line is not blank (/rocSPARSE/library/src/rocsparse_enums.f: 65)
F90-F-0008-Error limit exceeded (/rocSPARSE/library/src/rocsparse_enums.f: 61)
F90/x86-64 Linux Flang - 1.5 2017-05-01: compilation aborted
[29/137] Building CXX object library/CMakeFiles/rocsparse.dir/src/level2/rocsparse_bsrxmv_spzl_17_32.cpp.o
ninja: build stopped: subcommand failed.

How to reproduce

I attempted building git develop version of rocsparse in Ubuntu 20.04 after installing the rocm-dev package. I can provide a Dockerfile of my environment if requested.

Environment

Hardware	description
GPU	N/A
CPU	Ryzen 5900X

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	v0.0
Library	v0.0

Cannot build rocsparse (AUR 3.5.0-1)

What is the expected behavior

Package is build

What actually happens

Error:

How to reproduce

on Manjaro, in terminal:

pamac build rocsparse

Environment

https://pastebin.com/40fHXwB6

Software	version
ROCK	v3.5.1-3
ROCR	v3.5.0-1

Question about reporting performance of AMD MI250 Accelerator

Hello
I do not have any issue with rocSPARSE, but I do not know where else to ask...

When writing a simple program and running it on an AMD MI250 GPU, it runs on 1 out of 2 GCDs of the MI250.
In rocm-smi I can see two "separate" GPUs existing, and only one of them is occupied during running the benchmark.

Should I rewrite my program to run on the two GCDs, when running a performance benchmark for this GPU, or leave it as it is, underutilizing the available hardware?

What is the direction that you (the developers of rocSPARSE) follow, when reporting performance of rocSPARSE?

Thank you.

rocsparse_spmm dense matrices have to have same order

I was having trouble using the rocsparse_spmm function, the solve stage (only the solve stage) kept returning an error rocsparse_status_invalid_value. After a while of searching, I found this if in the source code, only allowing dense matrices B and C with the same storage order (row major or column major). Okay, this might make sense, since it's another case to implement (although cuSPARSE has no problem with that). But I could not find any info about this restriction in the documentation.

Please add a note about this to the docs (and please take a look at other functions if the same problem isn't there too).

Also, as a sidenote, please improve the error reporting (across all ROCm libraries). As far as I remember, CUDA libraries print an actually helpful error message before returning the status/error code.

Error on build

CMake Error at CMakeLists.txt:58 (string):
string sub-command REGEX, mode MATCH needs at least 5 arguments total to
command.

CMake Error at CMakeLists.txt:66 (message):
'hcc' or 'hipcc' compiler required to compile for ROCm platform.

5.1.3: 5 tests failed for gfx1031, comparison error exceeds 5 * tolerance

What is the expected behavior

All tests pass for Radeon RX 6700 XT (gfx1031)

What actually happens

There are 5 errors occur:

The difference between std::imag(A[i + j * LDA]) and std::imag(B[i + j * LDB]) is 5.245208740234375e-06, which exceeds std::imag(compare_val), where
std::imag(A[i + j * LDA]) evaluates to -0.0008339185151271522,
std::imag(B[i + j * LDB]) evaluates to -0.00082867330638691783, and
std::imag(compare_val) evaluates to 1.1920928955078125e-06.
[  FAILED  ] nightly/gebsrmv.level2/f32_c_193482_340123_2_0_0p67_1p5_row_1_8_NT_1b_rand, where GetParam() = { function: "gebsrmv", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_c", transA: "NT", transB: "NT", baseA: "1b", baseB: "0b", baseC: "0b", baseD: "0b", M: 193482, N: 340123, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 1, col_block_dimA: 8, row_block_dimB: 2, col_block_dimB: 2, dim_x: 0, dim_y: 0, dim_z: 0, alpha: 2.0, alphai: 0.0, beta: 0.67000000000000004, betai: 1.5, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "rand", matrix_init_kind: "default", file: "*", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "gebsrmv", category: "nightly", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_stride: -1 }

The difference between std::real(A[i + j * LDA]) and std::real(B[i + j * LDB]) is 5.7220458984375e-06, which exceeds std::real(compare_val), where
std::real(A[i + j * LDA]) evaluates to -0.00074288249015808105,
std::real(B[i + j * LDB]) evaluates to -0.00073716044425964355, and
std::real(compare_val) evaluates to 1.1920928955078125e-06.
/ext4-disk/build-vanilla/portage/sci-libs/rocSPARSE-5.1.3/work/rocSPARSE-rocm-5.1.3/clients/common/rocsparse_check.cpp:284: Failure
The difference between std::real(A[i + j * LDA]) and std::real(B[i + j * LDB]) is 5.7220458984375e-06, which exceeds std::real(compare_val), where
std::real(A[i + j * LDA]) evaluates to -0.00074288249015808105,
std::real(B[i + j * LDB]) evaluates to -0.00073716044425964355, and
std::real(compare_val) evaluates to 1.1920928955078125e-06.
[  FAILED  ] nightly/gebsrmv.level2/f32_c_193482_340123_2_0_0p67_1p5_row_11_8_NT_1b_rand, where GetParam() = { function: "gebsrmv", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_c", transA: "NT", transB: "NT", baseA: "1b", baseB: "0b", baseC: "0b", baseD: "0b", M: 193482, N: 340123, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 11, col_block_dimA: 8, row_block_dimB: 2, col_block_dimB: 2, dim_x: 0, dim_y: 0, dim_z: 0, alpha: 2.0, alphai: 0.0, beta: 0.67000000000000004, betai: 1.5, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "rand", matrix_init_kind: "default", file: "*", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "gebsrmv", category: "nightly", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_stride: -1 }

The difference between std::real(A[i + j * LDA]) and std::real(B[i + j * LDB]) is 6.198883056640625e-06, which exceeds std::real(compare_val), where
std::real(A[i + j * LDA]) evaluates to -0.00057590007781982422,
std::real(B[i + j * LDB]) evaluates to -0.00056970119476318359, and
std::real(compare_val) evaluates to 1.1920928955078125e-06.
/ext4-disk/build-vanilla/portage/sci-libs/rocSPARSE-5.1.3/work/rocSPARSE-rocm-5.1.3/clients/common/rocsparse_check.cpp:284: Failure
The difference between std::real(A[i + j * LDA]) and std::real(B[i + j * LDB]) is 6.198883056640625e-06, which exceeds std::real(compare_val), where
std::real(A[i + j * LDA]) evaluates to -0.00057590007781982422,
std::real(B[i + j * LDB]) evaluates to -0.00056970119476318359, and
std::real(compare_val) evaluates to 1.1920928955078125e-06.
[  FAILED  ] nightly/gebsrmv.level2/f32_c_193482_340123_2_0_0p67_1p5_row_20_8_NT_1b_rand, where GetParam() = { function: "gebsrmv", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_c", transA: "NT", transB: "NT", baseA: "1b", ba
seB: "0b", baseC: "0b", baseD: "0b", M: 193482, N: 340123, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 20, col_block_dimA: 8, row_block_dimB: 2, col_block_dimB: 2, dim_x: 0, dim_y: 0, dim_z: 0, alpha: 2.0, alphai: 0.0, beta: 0.67000000000000004, betai: 1.5, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interl
eaved_alg: "default", matrix: "rand", matrix_init_kind: "default", file: "*", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "gebsrmv", category: "nightly", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_stride: -1 }

The difference between A[i + j * LDA] and B[i + j * LDB] is 5.6624412536621094e-06, which exceeds compare_val, where
A[i + j * LDA] evaluates to -0.001365363597869873,
B[i + j * LDB] evaluates to -0.0013597011566162109, and
compare_val evaluates to 1.3653636870003538e-06.
WARNING near_check has been permissive with a tolerance multiplier equal to 3
[  FAILED  ] nightly/csrgemm.extra/f32_r_12942__3_0_n99_0_NT_NT_0b_0b_0b_0b_csr_bibd_22_8, where GetParam() = { function: "csrgemm", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_r", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 12942, K: 1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 0, dim_y: 0, dim_z: 0, alpha: 3.0, alphai: 0.0, beta: -99.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "bibd_22_8", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_mult_file", category: "nightly", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_stride: -1 }

The difference between A[i + j * LDA] and B[i + j * LDB] is 5.9008598327636719e-06, which exceeds compare_val, where
A[i + j * LDA] evaluates to -0.001365363597869873,
B[i + j * LDB] evaluates to -0.0013594627380371094, and
compare_val evaluates to 1.3653636870003538e-06.
[  FAILED  ] nightly/csrgemm_reuse.extra/f32_r_12942__3_0_n99_0_NT_NT_0b_0b_0b_0b_csr_bibd_22_8, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_r", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 12942, K: 1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 0, dim_y: 0, dim_z: 0, alpha: 3.0, alphai: 0.0, beta: -99.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "bibd_22_8", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_mult_file", category: "nightly", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_stride: -1 }

How to reproduce

Compile rocSPARSE with -DAMDGPU_TARGETS=gfx1031 and run rocsparse-test on Radeon 6700xt

Environment

I am packaging rocSPARSE-rocm-5.1.3 for Gentoo, the ebuild is located at https://github.com/littlewu2508/gentoo/blob/rocm-5.1.3/sci-libs/rocSPARSE/rocSPARSE-5.1.3.ebuild.

Hardware	description
GPU	AMD Radeon RX 6700 XT
CPU	AMD Ryzen 9 5950X 16-Core Processor

Software	version
ROCK	Linux 5.18.10
ROCR	rocm-5.1.3
LLVM/Clang	14.0.6
HIP	rocm-5.1.3
rocPRIM	rocm-5.1.3, all tests passed
rocSPARSE	rocm-5.1.3

Loss become a NaN using tensorflow-rocm (gfx803)

What is the expected behavior

Fitting on GPU must give a coherent result, similar to the result on CPU.

What actually happens

Loss value become a NaN during fitting and after that an InvalidArgumentError is raised.

Epoch 56/150
 1/10 [==>...........................] - ETA: 0s - loss: 0.1562 - mse: nanTraceback (most recent call last):
File "testGPU.py", line 19, in <module>
    model.fit(X, Y, batch_size=batch_size, epochs=epochs)
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
    tmp_logs = train_function(iterator)
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 807, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1843, in _filtered_call
    return self._call_flat(
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1923, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 545, in call
    outputs = execute.execute(
  File "/home/arnaud/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Input to reshape is a tensor with 12 values, but the requested shape has 1175549453906477056
	 [[node gradient_tape/mean_squared_error/Reshape (defined at testGPU.py:19) ]] [Op:__inference_train_function_508]

How to reproduce

Below code, work on CPU but failed on GPU.

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

X = tf.fill((300, 5), 5)
Y = tf.fill((300, 5), 10)

model = tf.keras.Sequential()
model.add(Dense(5))
model.compile(optimizer='adam', loss='mse', metrics=['mse'])

epochs = 150
batch_size = 32

print("Fitt...")
#with tf.device('/CPU:0'):#To test on GPU
model.fit(X, Y, batch_size=batch_size, epochs=epochs)

print("Evaluate...")
model.evaluate(X, Y)

print("Predict...")
result = model.predict(tf.fill((1, 5), 5))
print(result)

Environment

Hardware	description
GPU	gfx803 (Fiji [Radeon R9 FURY / NANO Series])
CPU	AMD Ryzen 5 3500X 6-Core Processor

I am sorry, I don't know how to check versions, but all should be the latest. I installed ROCm 3.9.1.

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	v0.0
Library	v0.0

I installed ROCm like explained on below page on Ubuntu 20.04.1:
https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

Then, I installed I followed below page:
https://github.com/xuhuisheng/rocm-build/blob/develop/docs/gfx803.md

rocsparse_dbsrmv undefined on CentOS 7.7

Howdy,

I am developing AcuSolve GPU on AMD cards and am converting my cuda code to rocm/hip code. However in the version of rocm I have installed for CentOS 7.7., the BSR format is not available. While it appears on the document page. It is not defined in the header. I am wondering if ROCM will be newer on CentOS 8 and the functions will be included. Thanks!

Below is the version info:

HIP version: 3.3.20126-2dbba46b
HCC clang version 10.0.0 (/data/jenkins_workspace/centos_pipeline_job_3.3/rocm-rel-3.3/rocm-3.3-19-20200328/centos/external/hcc-tot/llvm-project/clang 1ce0fe5e88b2124494b9500817b4c2c66bdfa5aa) (based on HCC 3.1.20114-6776c83f-1ce0fe5e88b )

reinstalling rocsparse from scratch

What is the expected behavior

What actually happens

How to reproduce

Environment

Hardware	description
GPU	device string
CPU	device string

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	v0.0
Library	v0.0

paolo@fastmmw:/FastMM/Epyc$ git clone https://github.com/ROCmSoftwarePlatform/rocSPARSE.git
Cloning into 'rocSPARSE'...
remote: Enumerating objects: 10498, done.
remote: Counting objects: 100% (1427/1427), done.
remote: Compressing objects: 100% (352/352), done.
remote: Total 10498 (delta 1153), reused 1341 (delta 1074), pack-reused 9071
Receiving objects: 100% (10498/10498), 4.82 MiB | 14.92 MiB/s, done.
Resolving deltas: 100% (8377/8377), done.
paolo@fastmmw:/FastMM/Epyc$ cd rocSPARSE/
paolo@fastmmw:~/FastMM/Epyc/rocSPARSE$ ./install.sh -dci
Creating project build directory in: ./build
Hit:1 http://dl.google.com/linux/chrome/deb stable InRelease
Hit:2 http://us.archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:4 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://us.archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:6 https://repo.radeon.com/rocm/apt/debian xenial InRelease
Reading package lists... Done
Building dependency tree
Reading state information... Done
66 packages can be upgraded. Run 'apt list --upgradable' to see them.
~/FastMM/Epyc/rocSPARSE ~/FastMM/Epyc/rocSPARSE
Building googletest from source; installing into /usr/local
-- Configuring gtest external dependency
-- ExternalGmock using ( 17 ) cores to build with
CMake Deprecation Warning at CMakeLists.txt:63 (cmake_policy):
The OLD behavior for policy CMP0037 will be removed from a future version
of CMake.

The cmake-policies(7) manual explains that the OLD behaviors of all
policies are deprecated and that a policy should be set to OLD only under
specific short-term circumstances. Projects should be ported to the NEW
behavior and not rely on setting a policy to OLD.

-- Configuring done
-- Generating done
-- Build files have been written to: /home/paolo/FastMM/Epyc/rocSPARSE/build/deps
Scanning dependencies of target googletest
[ 12%] Creating directories for 'googletest'
[ 25%] Performing download step (git clone) for 'googletest'
Cloning into 'googletest'...
Note: switching to 'release-1.10.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

git switch -c

Or undo this operation with:

git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 703bd9ca Googletest export
[ 37%] No patch step for 'googletest'
[ 50%] Performing update step for 'googletest'
[ 62%] Performing configure step for 'googletest'
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PythonInterp: /usr/bin/python (found version "2.7.18")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home/paolo/FastMM/Epyc/rocSPARSE/build/deps/gtest/src/googletest-build
[ 75%] Performing build step for 'googletest'
-- googletest build command succeeded. See also /home/paolo/FastMM/Epyc/rocSPARSE/build/deps/gtest/src/googletest-stamp/googletest-build-.log
[ 87%] No install step for 'googletest'
[100%] Completed 'googletest'
[100%] Built target googletest
[ 12%] Performing update step for 'googletest'
[ 25%] Performing configure step for 'googletest'
-- Configuring done
-- Generating done
-- Build files have been written to: /home/paolo/FastMM/Epyc/rocSPARSE/build/deps/gtest/src/googletest-build
[ 37%] Performing build step for 'googletest'
-- googletest build command succeeded. See also /home/paolo/FastMM/Epyc/rocSPARSE/build/deps/gtest/src/googletest-stamp/googletest-build-.log
[ 50%] No install step for 'googletest'
[ 62%] Completed 'googletest'
[100%] Built target googletest
Scanning dependencies of target install
[ 25%] Built target gtest
[ 50%] Built target gmock
[ 75%] Built target gmock_main
[100%] Built target gtest_main
Install the project...
-- Install configuration: ""
-- Up-to-date: /usr/local/include
-- Up-to-date: /usr/local/include/gmock
-- Installing: /usr/local/include/gmock/gmock-matchers.h
-- Installing: /usr/local/include/gmock/gmock-generated-actions.h.pump
-- Installing: /usr/local/include/gmock/gmock-cardinalities.h
-- Installing: /usr/local/include/gmock/gmock-generated-actions.h
-- Up-to-date: /usr/local/include/gmock/internal
-- Up-to-date: /usr/local/include/gmock/internal/custom
-- Installing: /usr/local/include/gmock/internal/custom/gmock-matchers.h
-- Installing: /usr/local/include/gmock/internal/custom/gmock-generated-actions.h.pump
-- Installing: /usr/local/include/gmock/internal/custom/gmock-generated-actions.h
-- Installing: /usr/local/include/gmock/internal/custom/gmock-port.h
-- Installing: /usr/local/include/gmock/internal/custom/README.md
-- Installing: /usr/local/include/gmock/internal/gmock-internal-utils.h
-- Installing: /usr/local/include/gmock/internal/gmock-port.h
-- Installing: /usr/local/include/gmock/internal/gmock-pp.h
-- Installing: /usr/local/include/gmock/gmock-generated-matchers.h
-- Installing: /usr/local/include/gmock/gmock-function-mocker.h
-- Installing: /usr/local/include/gmock/gmock.h
-- Installing: /usr/local/include/gmock/gmock-nice-strict.h
-- Installing: /usr/local/include/gmock/gmock-more-actions.h
-- Installing: /usr/local/include/gmock/gmock-generated-function-mockers.h
-- Installing: /usr/local/include/gmock/gmock-more-matchers.h
-- Installing: /usr/local/include/gmock/gmock-generated-function-mockers.h.pump
-- Installing: /usr/local/include/gmock/gmock-spec-builders.h
-- Installing: /usr/local/include/gmock/gmock-generated-matchers.h.pump
-- Installing: /usr/local/include/gmock/gmock-actions.h
-- Installing: /usr/local/lib/libgmock.a
-- Installing: /usr/local/lib/libgmock_main.a
-- Installing: /usr/local/lib/pkgconfig/gmock.pc
-- Installing: /usr/local/lib/pkgconfig/gmock_main.pc
-- Installing: /usr/local/lib/cmake/GTest/GTestTargets.cmake
-- Installing: /usr/local/lib/cmake/GTest/GTestTargets-noconfig.cmake
-- Installing: /usr/local/lib/cmake/GTest/GTestConfigVersion.cmake
-- Installing: /usr/local/lib/cmake/GTest/GTestConfig.cmake
-- Up-to-date: /usr/local/include
-- Up-to-date: /usr/local/include/gtest
-- Installing: /usr/local/include/gtest/gtest-typed-test.h
-- Installing: /usr/local/include/gtest/gtest-test-part.h
-- Up-to-date: /usr/local/include/gtest/internal
-- Installing: /usr/local/include/gtest/internal/gtest-type-util.h
-- Installing: /usr/local/include/gtest/internal/gtest-port.h
-- Up-to-date: /usr/local/include/gtest/internal/custom
-- Installing: /usr/local/include/gtest/internal/custom/gtest-port.h
-- Installing: /usr/local/include/gtest/internal/custom/gtest.h
-- Installing: /usr/local/include/gtest/internal/custom/README.md
-- Installing: /usr/local/include/gtest/internal/custom/gtest-printers.h
-- Installing: /usr/local/include/gtest/internal/gtest-port-arch.h
-- Installing: /usr/local/include/gtest/internal/gtest-death-test-internal.h
-- Installing: /usr/local/include/gtest/internal/gtest-type-util.h.pump
-- Installing: /usr/local/include/gtest/internal/gtest-internal.h
-- Installing: /usr/local/include/gtest/internal/gtest-param-util.h
-- Installing: /usr/local/include/gtest/internal/gtest-filepath.h
-- Installing: /usr/local/include/gtest/internal/gtest-string.h
-- Installing: /usr/local/include/gtest/gtest.h
-- Installing: /usr/local/include/gtest/gtest-message.h
-- Installing: /usr/local/include/gtest/gtest-death-test.h
-- Installing: /usr/local/include/gtest/gtest-spi.h
-- Installing: /usr/local/include/gtest/gtest_prod.h
-- Installing: /usr/local/include/gtest/gtest-matchers.h
-- Installing: /usr/local/include/gtest/gtest-printers.h
-- Installing: /usr/local/include/gtest/gtest-param-test.h
-- Installing: /usr/local/include/gtest/gtest_pred_impl.h
-- Installing: /usr/local/lib/libgtest.a
-- Installing: /usr/local/lib/libgtest_main.a
-- Installing: /usr/local/lib/pkgconfig/gtest.pc
-- Installing: /usr/local/lib/pkgconfig/gtest_main.pc
[100%] Built target install
~/FastMM/Epyc/rocSPARSE
~/FastMM/Epyc/rocSPARSE ~/FastMM/Epyc/rocSPARSE
-- The CXX compiler identification is unknown
-- The Fortran compiler identification is GNU 9.3.0
-- Check for working CXX compiler: /opt/rocm/bin/hipcc
-- Check for working CXX compiler: /opt/rocm/bin/hipcc -- broken
CMake Error at /usr/share/cmake-3.16/Modules/CMakeTestCXXCompiler.cmake:53 (message):
The C++ compiler

"/opt/rocm/bin/hipcc"

is not able to compile a simple test program.

It fails with the following output:

Change Dir: /home/paolo/FastMM/Epyc/rocSPARSE/build/release/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_e5a81/fast && /usr/bin/make -f CMakeFiles/cmTC_e5a81.dir/build.make CMakeFiles/cmTC_e5a81.dir/build
make[1]: Entering directory '/home/paolo/FastMM/Epyc/rocSPARSE/build/release/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_e5a81.dir/testCXXCompiler.cxx.o
/opt/rocm/bin/hipcc     -o CMakeFiles/cmTC_e5a81.dir/testCXXCompiler.cxx.o -c /home/paolo/FastMM/Epyc/rocSPARSE/build/release/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
sh: 1: /opt/rocm/llvm/bin/clang: not found
Use of uninitialized value $HIP_CLANG_VERSION in concatenation (.) or string at /opt/rocm/bin/hipcc line 162.
Can't exec "/opt/rocm/bin/rocm_agent_enumerator": No such file or directory at /opt/rocm/bin/hipcc line 592.
Use of uninitialized value $targetsStr in substitution (s///) at /opt/rocm/bin/hipcc line 593.
Use of uninitialized value $targetsStr in split at /opt/rocm/bin/hipcc line 599.
sh: 1: /opt/rocm/llvm/bin/clang: not found
make[1]: *** [CMakeFiles/cmTC_e5a81.dir/build.make:66: CMakeFiles/cmTC_e5a81.dir/testCXXCompiler.cxx.o] Error 127
make[1]: Leaving directory '/home/paolo/FastMM/Epyc/rocSPARSE/build/release/CMakeFiles/CMakeTmp'
make: *** [Makefile:121: cmTC_e5a81/fast] Error 2

CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
CMakeLists.txt:54 (project)

-- Configuring incomplete, errors occurred!
See also "/home/paolo/FastMM/Epyc/rocSPARSE/build/release/CMakeFiles/CMakeOutput.log".
See also "/home/paolo/FastMM/Epyc/rocSPARSE/build/release/CMakeFiles/CMakeError.log".
paolo@fastmmw:~/FastMM/Epyc/rocSPARSE$ git branch

develop
paolo@fastmmw:~

Feature requests and useful instructions

I adaped rocSparse (version rocSPARSE-rocm-5.4.1) to work with NAVI22 gfx1031 on RHEL 9.1 i.e. AlmaLinux 9.1.

I know, this architecture is not supported but implementing it as a target worked fine.

Feature requests probably relevant for other people as well:

1: Add AlmaLinux and Rocky_Linux to the install script. These are the most used RHEL bit-for-bit clones. I added "almalinux" and "AlmaLinux" and it worked fine.

2: Please mention the install script option -a gxf1031 (in my case) for compiling the source for a specific architecture to speed-up the process for such cases.

3: The tests ran for almost 3 hours, maybe you want to mention it next to the instructions so people know what to expect. A % of completion while the tests run would also be nice for people compiling the library from source.

Two out of the 283609 tests failed, maybe that's relevant for you:

[----------] Global test environment tear-down
[==========] 283609 tests from 331 test suites ran. (10227631 ms total)
[ PASSED ] 283607 tests.
[ FAILED ] 2 tests, listed below:
[ FAILED ] nightly/csritilu0.precond/f64_c_505193_0b_4_rand, where GetParam() = { function: "csritilu0", index_type_I: "i32", index_type_J: "i32", compute_type: "f64_c", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 505193, N: 505193, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: 1.0, alphai: 0.0, beta: 0.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "sync_split_fusion", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "rand", matrix_init_kind: "default", file: "*", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csritilu0", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }

[ FAILED ] nightly/csritilu0.precond/f64_c_505193_1b_4_rand, where GetParam() = { function: "csritilu0", index_type_I: "i32", index_type_J: "i32", compute_type: "f64_c", transA: "NT", transB: "NT", baseA: "1b", baseB: "0b", baseC: "0b", baseD: "0b", M: 505193, N: 505193, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: 1.0, alphai: 0.0, beta: 0.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "sync_split_fusion", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "rand", matrix_init_kind: "default", file: "*", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csritilu0", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }

2 FAILED TESTS

Environment

Hardware	description
GPU	gfx1031
CPU	5900HX

Software	version
ROCK	v rocm-5.4.1
ROCR	v rocm-5.4.1
HCC	v rocm-5.4.1
Library	v rocSPARSE-rocm-5.4.1

csrmv: Incorrect results for m x 0 matrices

What is the expected behavior

Multiplying a m x 0 matrix with a vector of length 0 should result in the output vector being zeroed

What actually happens

The output vector is not modified

How to reproduce

#include <hipsparse.h>
#include <iomanip>
#include <iostream>

int main() {
  double *mem;
  int *zeros;
  double alpha{1.0};
  double beta{0.0};
  hipsparseHandle_t handle;
  hipMalloc(&zeros, sizeof(int) * 2);
  hipMemset(zeros, 0, sizeof(int) * 2);
  hipMalloc(&mem, sizeof(double) * 3);
  hipMemset(mem, 0xCD, sizeof(double) * 3);
  hipsparseCreate(&handle);
  hipsparseSetPointerMode(handle, HIPSPARSE_POINTER_MODE_HOST);
  hipsparseMatDescr_t descr;
  hipsparseCreateMatDescr(&descr);
  hipsparseDcsrmv(handle, HIPSPARSE_OPERATION_NON_TRANSPOSE, 1, 1, 0, &alpha,
                  descr, nullptr, zeros, nullptr, mem, &beta, mem + 1);
  hipsparseDcsrmv(handle, HIPSPARSE_OPERATION_NON_TRANSPOSE, 1, 0, 0, &alpha,
                  descr, nullptr, zeros, nullptr, nullptr, &beta, mem + 2);
  long result{};
  hipMemcpy(&result, mem + 1, sizeof(long), hipMemcpyDeviceToHost);
  std::cout << std::hex << result << std::endl;
  hipMemcpy(&result, mem + 2, sizeof(long), hipMemcpyDeviceToHost);
  std::cout << std::hex << result << std::endl;
  hipsparseDestroyMatDescr(descr);
  hipsparseDestroy(handle);
  hipFree(zeros);
  hipFree(mem);
}

Both function calls (multiplying a vector of length 1 with a zero 1x1 matrix and multiplying a vector of length 0 with a 1x0 matrix) should compute the same result, but they result in -0 and the initial value, respectively.

Environment

Hardware	description
GPU	MI100

Software	version
ROCm	5.0.0

rocsparse_spsm doesn't return the correct result with transb = rocsparse_operation_transpose

What is the expected behavior

rocsparse_spsm should return the correct result of op(A) * x = transpose(B).

What actually happens

A wrong solution is returned.

How to reproduce

Solve any triangular system with multiple right-hand sides with transb = rocsparse_operation_transpose.
We also have a reproducer in the Julia interface AMDGPU.jl.

Environment

HIP Runtime v5.6.31062
rocBLAS v3.0.0
rocSOLVER v3.22.0
rocALUTION (the version with rocm-5.6.1)
rocSPARSE (the version with rocm-5.6.1)
rocRAND v2.10.5
rocFFT v1.0.21
MIOpen v2.20.0
HIPDevice(name="AMD Radeon RX 6700 XT", id=1, gcn_arch=gfx1030)

rocsparse_spsv not implemented for triangular matrices

What is the expected behavior

The docs state, that "rocsparse_spsv_solve solves a sparse triangular linear system ...". Calling the rocsparse_spsv with a triangular sparse matrix (that is, a rocsparse_spmat_descr where I set the attribute rocsparse_spmat_matrix_type to rocsparse_matrix_type_triangular) should work without issues and not return an error.

What actually happens

The function exits with error 2, corresponding to rocsparse_status_not_implemented.

How to reproduce

Probably not needed, will provide simple example if necessary

Environment

all

The most probable cause is e.g. this line of code, which makes the function accept only general matrices, triangular matrices fail with the error. This is present in other functions too.

Hipyfying (or roc-ifying) from cuSPARSE, where the matrix_type and storage_mode attributes do not exist at all, it might make sense to also accept general matrices. But it is weird that only general matrices are accepted. When working just with rocsparse, this behaviour is very unexpected. Why does the matrix_type attribute even exist, when using it breaks the code?

rocsparse fails to build in Manjaro with default setup

What is the expected behavior

rocsparse builds using default pamac install rocm-libs

What actually happens

[ 68%] Building CXX object library/CMakeFiles/rocsparse.dir/src/conversion/rocsparse_nnz_compress.cpp.o
fatal error: error in backend: Cannot select: 0x55d580125888: i64 = FrameIndex<0>
In function: _Z41csrgemm_fill_block_per_row_device_pointerI21rocsparse_complex_numIdELj128ELj16ELj256ELj137EEviPKiS3_PKT_S3_S3_S6_S3_S3_S6_S6_S3_S3_S6_S3_PiPS4_21rocsparse_index_base_S9_S9_S9_bb
clang-11: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 11.0.0 (https://aur.archlinux.org/llvm-amdgpu.git 38fffda570a017745da11c40ed8991c3c70a3281)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin

How to reproduce

pamac install rocm-libs

Environment

Hardware	description
GPU	Radeon 570X
CPU	Ryzen 2700X
-----	-----
Description	Manjaro Linux
Release	20.1
Codename	Mikah

|-----|-----|
| ROCK | v0.0 |
| ROCR | v0.0 |
| HCC | v0.0 |
| Library | v0.0 |

errors in second call to rocsparse_sps{v,m}

Hello,

I have cholesky factors of a matrix $A=U^\top U$. Now I want to solve the system $Az=x$, which is equivalent to first solving $U^\top y = x$ and then solving $Uz=y$. Implementing this in rocSPARSE, I am getting errors in the compute stage of the second system (for both cases when the right-hand-side is a vector or a matrix, but different errors).

What is the expected behavior

The attached program should not fail.

What actually happens

In the compute stage of the second system, the program segfaults in the case of rocsparse_spsm, and in the case of rocsparse_spsv, rosparse_status = 3 (probably rocsparse_status_invalid_pointer) is returned from the function.

How to reproduce

See source.hip.cpp.txt (remove the .txt extension, for some reason .cpp files are not allowed here). There I allocate device memory for the sparse matrix $U$ and vectors $x$, $y$, $z$. I fill the sparse matrix data such that the matrix represents an identity matrix (for simplicity), and set the vectors to zero. Create the rocsparse_handle and matrix descriptor and set the matrix attributes. Then, based on the command-line argument (V or M), I perform the solution of the two systems with either a right-hand-side vector or right-hand-side matrix (using rocsparse_spsv or rocsparse_spsm, respectively). These two systems (with $U^\top$ and $U$) and the code are identical except for the transpose parameter and the used buffer. To finally solve the system, I query the buffersize, allocate the buffer, and do the preprocess and compute stages. In the end, I destroy and free everything.

If I use a separate matrix descriptor for the second system, the program works fine. If the transpose parameters in the two systems are the same, the program works fine. Running only the second system (commenting out the first), the program works fine.

Compile using hipcc -g -O2 --offload-arch=gfx90a:sramecc+:xnack- source.hip.cpp -o program.x -lrocsparse. Run with ./program.x V for spsv and ./program.x M for spsm.

Environment

I use the LUMI supercomputer. Accelerated compute node with MI250x gpu (I use 1/8 of the node, that is a single GPU die), module load LUMI/22.12 rocm/5.2.3. I also tested with rocm/5.5.1 on MI100, and the problems are there too.

Thanks in advance for taking a look at this. From my side it looks like a bug in rocSPARSE. In case I missed an important detail or you need more info, please ask.

Jakub

[Build Error] [CMake] Cannot find hip

What is the expected behavior

Including rocsparse in a cmake project like find_package(rocsparse) should be successful.

What actually happens

When using find_package(rocsparse), it fails since it cannot find hip, even though hip is installed.

CMake Error at /usr/share/cmake-3.18/Modules/CMakeFindDependencyMacro.cmake:47 (find_package):
  By not providing "Findhip.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "hip", but
  CMake did not find one.

  Could not find a package configuration file provided by "hip" with any of
  the following names:

    hipConfig.cmake
    hip-config.cmake

  Add the installation prefix of "hip" to CMAKE_PREFIX_PATH or set "hip_DIR"
  to a directory containing one of the above files.  If "hip" provides a
  separate development package or SDK, be sure it has been installed.
Call Stack (most recent call first):
  /opt/rocm/rocsparse/lib/cmake/rocsparse/rocsparse-config.cmake:89 (find_dependency)
  cmake/public/LoadHIP.cmake:131 (find_package)
  cmake/public/LoadHIP.cmake:169 (find_package_and_print_version)
  cmake/Dependencies.cmake:995 (include)
  CMakeLists.txt:411 (include)

FindHIP.cmake is installed on the system but Findhip.cmake is not.

How to reproduce

Call find_package(rocsparse) in a cmake project.

Environment

Hardware	description
GPU	device string
CPU	device string

Software	version
ROCK	v3.5.1
ROCR	v3.5.0
HCC	N/A
Library	v3.5.0

Downstream issue: pytorch/pytorch#41886

Looking for sparseSpMV API example

Hi,

I am looking for simple sparseSPMV API example similar to this cuSPARSE example below:

https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSE/spmv_csr/spmv_csr_example.c

As you see the above example is very straightforward, which just creates a sparse matrix (cusparseCreate), input and output vector (cusparseCreateDnVec), and then uses library function cusparseSpMV to perform SPMV.

I am using AMD's MI-25 with Rocm-4.2 version. I am trying to look at both the rocSPARSE repo and hipSPARSE repo, but did not find any such SPMV example similar to CUDA example above. So I have the following questions, please help if possible

i) Should I use hipSPARSE or rocSPARSE? I am thinking to use rocSPARSE because it seems more lower level, also I only intend to run on MI-25 so support of cuSPARSE at the backend is not a requirement for me ( which apparently hipSPARSE supports as it is more higher level).

ii) Do you have any sparseSPMV example similar to the Cuda example above? If the example is available either using hipSparse or rocSparse I would probably use that.

iii) Do I need to install rocSPARSE (or hipSPARSE) on top of 4.2 ROCm? Or 4.2 ROCm natively support Sparse functions. For CUDA, if I install the latest CUDA, I don't need to require to install the cuSPARSE library separately as the latest CUDA supports sparseSPMV operation natively. Is this the case with ROCm-4.2 or I must need to install rocSPARSE (or hipSPARSE) separately?

Thanks for any help,

Enable gfx1010 by default in rocSPARSE builds

https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/a1c59eaacdc8c4e31923321bfa3242fb265f6949/CMakeLists.txt#L155

https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/a1c59eaacdc8c4e31923321bfa3242fb265f6949/CMakeLists.txt#L160

What is the expected behavior

I am aware that gfx1010 is not officially supported, yet I believe that some of the architectures quoted above(ie: gfx803) are not supported either, additionally gfx1010 is enabled by default on rocBLAS. I'm building rocSPARSE for gfx1010 successfully on both Arch Linux and Ubuntu, it works, so ideally gfx1010 could be added to the default targets in the same unsupported way. This would hopefully cascade to downstream distros and lower ROCm friction.

What actually happens

gfx1010 is not built by default 🙁

How to reproduce

Standard build.

Environment

Hardware	description
GPU	AMD Radeon RX 5700 XT

Software	version
ROCM	v5.6.0

develop branch seems to have a different behavior for choosing general or adaptive csrmv

What is the expected behavior

I'm not sure. Advice required.

What actually happens

I've noticed that traces/profiles contain different kernels when I use the release version of rocSPARSE (ROCm 5.4.2) and the current develop branch.
In my case (PETSc) hipsparseSpMV is used with CSR matrices without calling hipsparseSpMV_preprocess before.
rocm-5.4.2 launches csrmvn_adaptive_kernel, develop launches csrmvn_general_kernel.
Does it mean that the old version preprocesses the matrix implicitly and the new version requires an explicit call hipsparseSpMV_preprocess (i.e. rocsparse_spmv(..., rocsparse_spmv_stage_preprocess, ...))?
rocsparse-bench -f csrmv also changed its behavior, but it's strange that it is inconsistent when verification is used:
develop launches csrmvn_general_kernel, with --verify 1 it launches csrmvn_adaptive_kernel.
rocm-5.4.2 uses csrmvn_adaptive_kernel for both cases.

How to reproduce

./rocsparse-bench -f csrmv --precision s --bench-x --rocalution ${ROCSPARSE_BENCH_DATA_DIR}/*.csr --iters 10 --bench-n 1

rocsparse-bench -f csrmv --precision s --bench-x --rocalution ${ROCSPARSE_BENCH_DATA_DIR}/*.csr --iters 10 --bench-n 1 --verify 1

Environment

Software	version
Library	current develop e77b2befaf17e8c8dd9b8b64d03237bebf46f9e8 and rocm-5.4.2

Missing one line in the example C code of rocsparse

This line is missing:

#include "hip/hip_runtime.h"

bsrmm: Incorrect stride handling for transposed B

What is the expected behavior

[1 1;1 1] * [1; 1] == [1; 1] regardless of the strides of the vectors

What actually happens

It seems that rocSPARSE uses the number of columns instead of the stride for accessing the input vector.

How to reproduce

#include <hipsparse.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iomanip>
#include <iostream>

int main() {
  double alpha{1.0};
  double beta{0.0};
  auto host_row_ptrs = thrust::host_vector<int>(2);
  host_row_ptrs[0] = 0;
  host_row_ptrs[1] = 1;
  auto host_values = thrust::host_vector<double>(4, 1);
  auto host_in_vector = thrust::host_vector<double>(3, 1);
  host_in_vector[1] = 1e100;
  thrust::device_vector<int> row_ptrs = host_row_ptrs;
  thrust::device_vector<int> col_idxs(1, 0);
  thrust::device_vector<double> values(4, 1);
  thrust::device_vector<double> in_vector = host_in_vector;
  thrust::device_vector<double> out_vector(2, 12345);
  hipsparseHandle_t handle;
  hipsparseCreate(&handle);
  hipsparseSetPointerMode(handle, HIPSPARSE_POINTER_MODE_HOST);
  hipsparseMatDescr_t descr;
  hipsparseCreateMatDescr(&descr);
  auto status = hipsparseDbsrmm(handle,
                  HIPSPARSE_DIRECTION_COLUMN,
                  HIPSPARSE_OPERATION_NON_TRANSPOSE,
                  HIPSPARSE_OPERATION_TRANSPOSE,
                  1, // mb
                  1, // n
                  1, // kb
                  1, // nnzb
                  &alpha,
                  descr,
                  values.data().get(),
                  row_ptrs.data().get(),
                  col_idxs.data().get(),
                  2, // blockDim
                  in_vector.data().get(), 2,
                  &beta, out_vector.data().get(), 2);
  std::cout << status << '\n';
  hipDeviceSynchronize();
  thrust::host_vector<double> host_out_vector = out_vector;
  std::cout << host_out_vector[0] << ' ' << host_out_vector[1] << std::endl;
  hipsparseDestroyMatDescr(descr);
  hipsparseDestroy(handle);
}

The result should be [2,2], but it is [1+1e100, 1+1e100]

Environment

Hardware	description
GPU	MI100

Software	version
ROCm	5.0.0

Default matrix storage mode

Hi all, this is not an issue so I'm not following the template. I'm wondering if it makes more sense to set the default storage mode of a _rocsparse_mat_descr to unsorted here:

https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/d8e35fc37ea738c599a6ca7f97ad2fddd275a622/library/src/include/handle.h#L110

I'm bringing this up because recently we hit a bug in hypre during sparse matrix/matrix multiplication that was quite time-consuming to debug. It turned out that the current rocSPARSE implementation for SpGEMM requires the columns of a given row to be sorted and we didn't have that. Had the default storage mode of a rocsparse matrix be "unsorted", we would have figured out the bug much quicker. Having said that, we could always have called rocsparse_set_mat_storage_method with rocsparse_storage_mode_unsorted for our matrices...

Test failures with XNACK enabled

What is the expected behavior

All tests pass with xnack enabled on supported GPUs.

What actually happens

There are test failures with xnack.

Excerpt from test failures

> HSA_XNACK=1 clients/staging/rocsparse-test --gtest_filter="*csrgemm*"
...
[  FAILED  ] 9 tests, listed below:
[  FAILED  ] nightly/csrgemm.extra/f64_r_1_5239_1_2_0_n99_0_NT_NT_0b_0b_0b_0b_csr_bmwcra_1, where GetParam() = { function: "csrgemm", index_type_I: "i32", index_type_J: "i32", compute_type: "f64_r", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 5239, K: 1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: 2.0, alphai: 0.0, beta: -99.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "bmwcra_1", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_mult_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f64_c_842323_1492312_743434_0_1p5_n99_0_NT_NT_0b_0b_0b_0b_rand, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f64_c", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 842323, N: 1492312, K: 743434, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: 0.0, alphai: 1.5, beta: -99.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "rand", matrix_init_kind: "default", file: "*", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_mult", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f64_r_1_5239_1_2_0_n99_0_NT_NT_0b_0b_0b_0b_csr_bmwcra_1, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f64_r", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 5239, K: 1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: 2.0, alphai: 0.0, beta: -99.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "bmwcra_1", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_mult_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f32_r_1_1_n1_n99_0_3_0_NT_NT_0b_0b_0b_0b_csr_bibd_22_8, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_r", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 1, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: -99.0, alphai: 0.0, beta: 3.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "bibd_22_8", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_scale_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f64_r_1_1_n1_n99_0_3_0_NT_NT_0b_0b_0b_0b_csr_Chebyshev4, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f64_r", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 1, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: -99.0, alphai: 0.0, beta: 3.0, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "Chebyshev4", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_scale_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f64_r_1_1_n1_n99_0_n0p5_0_NT_NT_0b_0b_0b_0b_csr_Chebyshev4, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f64_r", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 1, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: -99.0, alphai: 0.0, beta: -0.5, betai: 0.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "Chebyshev4", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_scale_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f32_c_1_1_n1_n99_0_0_1p5_NT_NT_0b_0b_0b_0b_csr_Chevron4, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_c", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 1, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: -99.0, alphai: 0.0, beta: 0.0, betai: 1.5, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "Chevron4", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_scale_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f32_c_1_1_n1_n99_0_3_1p5_NT_NT_0b_0b_0b_0b_csr_Chevron4, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_c", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 1, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: -99.0, alphai: 0.0, beta: 3.0, betai: 1.5, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "Chevron4", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_scale_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }
[  FAILED  ] nightly/csrgemm_reuse.extra/f32_c_1_1_n1_n99_0_n0p5_1_NT_NT_0b_0b_0b_0b_csr_Chevron4, where GetParam() = { function: "csrgemm_reuse", index_type_I: "i32", index_type_J: "i32", compute_type: "f32_c", transA: "NT", transB: "NT", baseA: "0b", baseB: "0b", baseC: "0b", baseD: "0b", M: 1, N: 1, K: -1, nnz: -1, block_dim: 2, row_block_dimA: 2, col_block_dimA: 2, row_block_dimB: 2, col_block_dimB: 2, dim_x: 1, dim_y: 1, dim_z: 1, ll: -2, l: -1, u: 1, uu: 2, alpha: -99.0, alphai: 0.0, beta: -0.5, betai: 1.0, threshold: 1.0, percentage: 0.0, action: "num", part: "auto", matrix_type: "general", diag: "ND", uplo: "L", storage: "sorted", analysis_policy: "reuse", solve_policy: "auto", direction: "row", order: "col", format: "coo", itilu0_alg: "async_inplace", sddmm_alg: "default", spmv_alg: "default", spsv_alg: "default", spsm_alg: "default", spmm_alg: "default", spgemm_alg: "default", sparse_to_dense_alg: "default", dense_to_sparse_alg: "default", gtsv_interleaved_alg: "default", gpsv_interleaved_alg: "default", matrix: "csr", matrix_init_kind: "default", file: "Chevron4", algo: 0, numeric_boost: 0, boost_tol: 0.0, boost_val: 1.0, boost_vali: 0.0, tolm: 1.0, name: "csrgemm_reuse_scale_file", category: "nightly", hardware: "all", unit_check: 1, timing: 0, iters: 10, denseld: -1, batch_count: -1, batch_count_A: -1, batch_count_B: -1, batch_count_C: -1, batch_stride: -1 }

How to reproduce

Build with an xnack GPU target enabled (the default setting is sufficient as it contains xnack for the MI200).
Run the test suite with xnack enabled (HSA_XNACK=1)

HSA_XNACK=1 clients/staging/rocsparse-test

Environment

Hardware	description
GPU	MI210 (x4)
CPU	AMD EPYC 7763 64-Core

Software	version
ROCK	AMDGPU Driver version: `5.18.2.22.40` Linux Kernel: `Linux ixt-sjc2-124 5.15.0-52-generic #58~20.04.1-Ubuntu SMP Thu Oct 13 13:09:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux`
ROCR	ROCm 5.3.0
HCC	HIP version: 5.3.22061-e8e78f1a AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/rocm-5.3.0/llvm/bin
Library	`develop` at time of writing (Wed 09 Nov 2022) `3a05469`

transA for rocsparse_csrmm?

What is the expected behavior

Currently the documentation (https://rocsparse.readthedocs.io/en/master/usermanual.html#rocsparse-csrmm) has a note saying only transA = rocsparse_operation_none is supported. Are there plans to update this for transpose and conjugate transpose?

What actually happens

How to reproduce

Environment

| Software | version |
rocm-3.5.0

2 failed tests from GoogleTestVerification.UninstantiatedParameterizedTestSuite<csrilusv>

What is the expected behavior

rocsparse-test pass all the tests

What actually happens

2 failed tests from GoogleTestVerification.UninstantiatedParameterizedTestSuite, detailed info:

[----------] 2 tests from GoogleTestVerification
[ RUN      ] GoogleTestVerification.UninstantiatedParameterizedTestSuite<csricsv>
/tmp/portage/sci-libs/rocSPARSE-4.3.0/work/rocSPARSE-rocm-4.3.0/clients/tests/test_csricsv.cpp:89: Failure
Parameterized test suite csricsv is defined via TEST_P, but never instantiated. None of the test cases will run. Either no INSTANTIATE_TEST_SUITE_P is provided or the only ones provided expand to
 nothing.

Ideally, TEST_P definitions should only ever be included as part of binaries that intend to use them. (As opposed to, for example, being placed in a library that may be linked in to get other uti
lities.)

To suppress this error for this test suite, insert the following line (in a non-header) in the namespace it is defined in:

GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(csricsv);
[  FAILED  ] GoogleTestVerification.UninstantiatedParameterizedTestSuite<csricsv> (0 ms)
[ RUN      ] GoogleTestVerification.UninstantiatedParameterizedTestSuite<csrilusv>
/tmp/portage/sci-libs/rocSPARSE-4.3.0/work/rocSPARSE-rocm-4.3.0/clients/tests/test_csrilusv.cpp:89: Failure
Parameterized test suite csrilusv is defined via TEST_P, but never instantiated. None of the test cases will run. Either no INSTANTIATE_TEST_SUITE_P is provided or the only ones provided expand t
o nothing.

Ideally, TEST_P definitions should only ever be included as part of binaries that intend to use them. (As opposed to, for example, being placed in a library that may be linked in to get other uti
lities.)

To suppress this error for this test suite, insert the following line (in a non-header) in the namespace it is defined in:

GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(csrilusv);
[  FAILED  ] GoogleTestVerification.UninstantiatedParameterizedTestSuite<csrilusv> (0 ms)

How to reproduce

Compile rocSPARSE 4.3.0 with tests enabled, and run ./ rocsparse-test

Environment

Hardware	description
GPU	Vega 20 [Radeon VII]
CPU	AMD Ryzen 7 5800X

Software	version
Linux	5.13.3
ROCK	Upstream kernel
ROCR	v4.3.0
HIPCC	v4.3.0
Library	v4.3.0

[issue or Feature Request] extract the strictly lower part, diagonal, invert the diagonal, add-subtract matrices, deal with fill-inn?

Does rocsparse support the following operations: I can't find them, I hope I have been looking in the wrong spot but ChatGPT couldn't find these operations either.

Given a sparse, SPD matrix A stored on the GPU in CSR format using rocThrust vectors, there will be no host-device transfers, the application runs on the GPU once the initial data has been transferred to the GPU. This is part of the development of an AINV preconditioner.

1: compute the matrix K1 = I-L*D⁻1 where I is the identity matrix, L the stricktly lower part of A, and D⁻1 the invers diagonal of A

2: compute the matrix K2 = I-D⁻1*L^T where I is the identity matrix, L^T the transpose of the stricktly lower part of A, and D⁻1 the invers diagonal of A

3: compute the matrix P = K1*K2 >> use rocsparse_dcsrmm(); this is possible how about 1: and 2: extract the strictly lower part, diagonal, invert the extracted diagonal, add-and-subtract matrices, deal with fill-inn (support thresholds)?

I am using ROCm 5.7.1 on (RHEL) Almalinux 9.3 and I tested rocALUTION for the application which uses all these operations but due to its host-device approach it's not suitable for the application which is about on-device computation over thousands of timesteps.

Illegal Instruction Detected during compilation

What is the expected behavior

Ideally, compilation should finish fine, or if the underlying GPU target is invalid, the CMake configuration should give warning and remove invalid targets.

What actually happens

CMake uses the AMDGPU_TARGETS variable without checking if it is valid. The log file below had AMDGPU_TARGETS=gfx000;gfx1010 to match the output of rocm_agent_enumerator on my computer.

cmake && make

-- Setting build type to 'Release' as none was specified.
-- The CXX compiler identification is Clang 14.0.0
-- The Fortran compiler identification is GNU 11.2.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm/bin/hipcc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /usr/bin/f95 - skipped
-- Using hip-clang to build for amdgpu backend
-- Found Git: /usr/bin/git (found version "2.35.3") 
-- The C compiler identification is GNU 11.2.0
-- The CXX compiler identification is Clang 14.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm/bin/hipcc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- rocm-cmake: Set license file to /build/rocsparse/src/extern/rocm-cmake-master/LICENSE.
-- Configuring done
-- Generating done
-- Build files have been written to: /build/rocsparse/src/extern/rocm-cmake-master
-- GLOB mismatch!
-- rocm-cmake: Set license file to /build/rocsparse/src/extern/rocm-cmake-master/LICENSE.
-- Configuring done
-- Generating done
-- Build files have been written to: /build/rocsparse/src/extern/rocm-cmake-master
Install the project...
-- Install configuration: ""
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMUtilities.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMDoxygenDoc.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMInstallSymlinks.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMConfig.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMHeaderWrapper.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/header_template.h.in
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMSphinxDoc.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMAnalyzers.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMClients.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMCreatePackage.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMCheckTargetIds.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMChecks.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMInstallTargets.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMPackageConfigHelpers.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMCppCheck.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMClangTidy.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMSetupVersion.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/./share/rocm/cmake/ROCMDocs.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/share/rocm/cmake/ROCMConfigVersion.cmake
-- Installing: /build/rocsparse/src/extern/rocm-cmake/share/doc/rocm-cmake/LICENSE
-- Performing Test COMPILER_HAS_TARGET_ID_gfx803
-- Performing Test COMPILER_HAS_TARGET_ID_gfx803 - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx900_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx900_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx906_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx906_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx908_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_off - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on
-- Performing Test COMPILER_HAS_TARGET_ID_gfx90a_xnack_on - Success
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1030
-- Performing Test COMPILER_HAS_TARGET_ID_gfx1030 - Success
-- AMDGPU_TARGETS: gfx000;gfx1010
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Success
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
CMake Warning at /build/rocsparse/src/extern/rocm-cmake/share/rocm/cmake/ROCMUtilities.cmake:50 (message):
  Could not determine the version of program rpmbuild.
Call Stack (most recent call first):
  /build/rocsparse/src/extern/rocm-cmake/share/rocm/cmake/ROCMCreatePackage.cmake:283 (rocm_find_program_version)
  library/CMakeLists.txt:155 (rocm_create_package)


-- Configuring done
-- Generating done
-- Build files have been written to: /build/rocsparse/src
Scanning dependencies of target rocsparse_fortran
[  1%] Building Fortran object library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse_enums.f90.o
[  2%] Building Fortran object library/CMakeFiles/rocsparse_fortran.dir/src/rocsparse.f90.o
[  2%] Built target rocsparse_fortran
[  2%] Building CXX object library/CMakeFiles/rocsparse.dir/src/handle.cpp.o
[  3%] Building CXX object library/CMakeFiles/rocsparse.dir/src/status.cpp.o
[  4%] Building CXX object library/CMakeFiles/rocsparse.dir/src/rocsparse_auxiliary.cpp.o
[  4%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_axpyi.cpp.o
[  5%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_doti.cpp.o
[  6%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_dotci.cpp.o
[  7%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_gthr.cpp.o
[  7%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_gthrz.cpp.o
[  8%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_roti.cpp.o
[  9%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_sctr.cpp.o
[ 10%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_axpby.cpp.o
[ 10%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_gather.cpp.o
[ 11%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_scatter.cpp.o
[ 12%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_rot.cpp.o
[ 13%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level1/rocsparse_spvv.cpp.o
[ 13%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level2/rocsparse_bsrmv.cpp.o
[ 14%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level2/rocsparse_bsrxmv.cpp.o
[ 15%] Building CXX object library/CMakeFiles/rocsparse.dir/src/level2/rocsparse_bsrxmv_spzl_2x2.cpp.o
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr2 = V_MOV_B32_dpp undef $vgpr2(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr5, 322, 10, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
fatal error: too many errors emitted, stopping now [-ferror-limit=]
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 323, 12, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr7, 323, 12, 15, 0, implicit $exec
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
renamable $vgpr2 = V_MOV_B32_dpp undef $vgpr2(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr5, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 323, 12, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr7, 323, 12, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
renamable $vgpr6 = V_MOV_B32_dpp undef $vgpr6(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
renamable $vgpr7 = V_MOV_B32_dpp undef $vgpr7(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr5, 322, 10, 15, 0, implicit $exec
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
renamable $vgpr7 = V_MOV_B32_dpp undef $vgpr7(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
renamable $vgpr7 = V_MOV_B32_dpp undef $vgpr7(tied-def 0), $vgpr5, 323, 12, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 323, 12, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
renamable $vgpr6 = V_MOV_B32_dpp undef $vgpr6(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
renamable $vgpr7 = V_MOV_B32_dpp undef $vgpr7(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr5, 322, 10, 15, 0, implicit $exec
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
renamable $vgpr7 = V_MOV_B32_dpp undef $vgpr7(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
renamable $vgpr7 = V_MOV_B32_dpp undef $vgpr7(tied-def 0), $vgpr5, 323, 12, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr6, 323, 12, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
renamable $vgpr12 = V_MOV_B32_dpp undef $vgpr12(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
renamable $vgpr13 = V_MOV_B32_dpp undef $vgpr13(tied-def 0), $vgpr5, 322, 10, 15, 0, implicit $exec
renamable $vgpr14 = V_MOV_B32_dpp undef $vgpr14(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr15 = V_MOV_B32_dpp undef $vgpr15(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
renamable $vgpr16 = V_MOV_B32_dpp undef $vgpr16(tied-def 0), $vgpr10, 322, 10, 15, 0, implicit $exec
renamable $vgpr17 = V_MOV_B32_dpp undef $vgpr17(tied-def 0), $vgpr11, 322, 10, 15, 0, implicit $exec
renamable $vgpr2 = V_MOV_B32_dpp undef $vgpr2(tied-def 0), $vgpr8, 322, 10, 15, 0, implicit $exec
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr9, 322, 10, 15, 0, implicit $exec
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr10, 322, 10, 15, 0, implicit $exec
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr11, 322, 10, 15, 0, implicit $exec
renamable $vgpr14 = V_MOV_B32_dpp undef $vgpr14(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr15 = V_MOV_B32_dpp undef $vgpr15(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
renamable $vgpr16 = V_MOV_B32_dpp undef $vgpr16(tied-def 0), $vgpr12, 322, 10, 15, 0, implicit $exec
renamable $vgpr17 = V_MOV_B32_dpp undef $vgpr17(tied-def 0), $vgpr13, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
renamable $vgpr12 = V_MOV_B32_dpp undef $vgpr12(tied-def 0), $vgpr4, 323, 12, 15, 0, implicit $exec
renamable $vgpr13 = V_MOV_B32_dpp undef $vgpr13(tied-def 0), $vgpr5, 323, 12, 15, 0, implicit $exec
renamable $vgpr14 = V_MOV_B32_dpp undef $vgpr14(tied-def 0), $vgpr6, 323, 12, 15, 0, implicit $exec
renamable $vgpr15 = V_MOV_B32_dpp undef $vgpr15(tied-def 0), $vgpr7, 323, 12, 15, 0, implicit $exec
renamable $vgpr16 = V_MOV_B32_dpp undef $vgpr16(tied-def 0), $vgpr10, 323, 12, 15, 0, implicit $exec
renamable $vgpr17 = V_MOV_B32_dpp undef $vgpr17(tied-def 0), $vgpr11, 323, 12, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr2, 322, 10, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr3, 322, 10, 15, 0, implicit $exec
renamable $vgpr12 = V_MOV_B32_dpp undef $vgpr12(tied-def 0), $vgpr4, 322, 10, 15, 0, implicit $exec
renamable $vgpr13 = V_MOV_B32_dpp undef $vgpr13(tied-def 0), $vgpr5, 322, 10, 15, 0, implicit $exec
renamable $vgpr14 = V_MOV_B32_dpp undef $vgpr14(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr15 = V_MOV_B32_dpp undef $vgpr15(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
renamable $vgpr16 = V_MOV_B32_dpp undef $vgpr16(tied-def 0), $vgpr10, 322, 10, 15, 0, implicit $exec
renamable $vgpr17 = V_MOV_B32_dpp undef $vgpr17(tied-def 0), $vgpr11, 322, 10, 15, 0, implicit $exec
renamable $vgpr2 = V_MOV_B32_dpp undef $vgpr2(tied-def 0), $vgpr8, 322, 10, 15, 0, implicit $exec
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr9, 322, 10, 15, 0, implicit $exec
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), $vgpr10, 322, 10, 15, 0, implicit $exec
renamable $vgpr5 = V_MOV_B32_dpp undef $vgpr5(tied-def 0), $vgpr11, 322, 10, 15, 0, implicit $exec
renamable $vgpr14 = V_MOV_B32_dpp undef $vgpr14(tied-def 0), $vgpr6, 322, 10, 15, 0, implicit $exec
renamable $vgpr15 = V_MOV_B32_dpp undef $vgpr15(tied-def 0), $vgpr7, 322, 10, 15, 0, implicit $exec
renamable $vgpr16 = V_MOV_B32_dpp undef $vgpr16(tied-def 0), $vgpr12, 322, 10, 15, 0, implicit $exec
renamable $vgpr17 = V_MOV_B32_dpp undef $vgpr17(tied-def 0), $vgpr13, 322, 10, 15, 0, implicit $exec
renamable $vgpr8 = V_MOV_B32_dpp undef $vgpr8(tied-def 0), $vgpr2, 323, 12, 15, 0, implicit $exec
renamable $vgpr9 = V_MOV_B32_dpp undef $vgpr9(tied-def 0), $vgpr3, 323, 12, 15, 0, implicit $exec
renamable $vgpr12 = V_MOV_B32_dpp undef $vgpr12(tied-def 0), $vgpr4, 323, 12, 15, 0, implicit $exec
renamable $vgpr13 = V_MOV_B32_dpp undef $vgpr13(tied-def 0), $vgpr5, 323, 12, 15, 0, implicit $exec
renamable $vgpr14 = V_MOV_B32_dpp undef $vgpr14(tied-def 0), $vgpr6, 323, 12, 15, 0, implicit $exec
renamable $vgpr15 = V_MOV_B32_dpp undef $vgpr15(tied-def 0), $vgpr7, 323, 12, 15, 0, implicit $exec
renamable $vgpr16 = V_MOV_B32_dpp undef $vgpr16(tied-def 0), $vgpr10, 323, 12, 15, 0, implicit $exec
renamable $vgpr17 = V_MOV_B32_dpp undef $vgpr17(tied-def 0), $vgpr11, 323, 12, 15, 0, implicit $exec
20 errors generated when compiling for gfx1010.
make[2]: *** [library/CMakeFiles/rocsparse.dir/build.make:314: library/CMakeFiles/rocsparse.dir/src/level2/rocsparse_bsrxmv_spzl_2x2.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:126: library/CMakeFiles/rocsparse.dir/all] Error 2
make: *** [Makefile:156: all] Error 2

How to reproduce

Compile rocSPARE with -DAMDGPU_TARGETS=gfx1010. This was done using the latest release at the time of submitting this issue (ROCm 5.1.1).

Documentation is broken

Official documentation for rocSPARSE is broken for a lot of functions.

One such example:

Compilation error, no int32_t fma candidate function

What is the expected behavior

rocSPARSE compiled without issue

What actually happens

In file included from /ext4-disk/build-vanilla-clang16/portage/sci-libs/rocSPARSE-9999/work/rocSPARSE-9999/library/src/level1/rocsparse_axpyi.cpp:26:
In file included from /ext4-disk/build-vanilla-clang16/portage/sci-libs/rocSPARSE-9999/work/rocSPARSE-9999/library/src/level1/axpyi_device.h:27:
/ext4-disk/build-vanilla-clang16/portage/sci-libs/rocSPARSE-9999/work/rocSPARSE-9999/library/src/include/common.h:74:92: error: call to 'fma' is ambiguous
__device__ __forceinline__ int32_t rocsparse_fma(int32_t p, int32_t q, int32_t r) { return fma(p, q, r); }
                                                                                           ^~~
/usr/include/hip/amd_detail/amd_math_functions.h:1262:17: note: candidate function
inline _Float16 fma(_Float16 x, _Float16 y, _Float16 z) {
                ^
/usr/lib/llvm/16/bin/../../../../lib/clang/16/include/__clang_cuda_math_forward_declares.h:71:19: note: candidate function
__DEVICE__ double fma(double, double, double);
                  ^
/usr/lib/llvm/16/bin/../../../../lib/clang/16/include/__clang_cuda_math_forward_declares.h:72:18: note: candidate function
__DEVICE__ float fma(float, float, float);
                 ^
1 error generated when compiling for gfx1031.

removing __device__ __forceinline__ int32_t rocsparse_fma(int32_t p, int32_t q, int32_t r) { return fma(p, q, r); } mitigates the issue. I checked the hip and llvm sources, and find that there's no int32_t fma functions.

Environment

Hardware	description
GPU	gfx1031

Software	version
ROCK	Linux-6.1
ROCR	a0d5e18e7752563daf4da970eae5ac8b6056a4c0
hipamd	b242cbcaa52e1ee9293382996c6573d2b9f3601a
Library	`3fa4ef3`