Giter Site home page Giter Site logo

rocm / rocblas Goto Github PK

View Code? Open in Web Editor NEW
317.0 58.0 149.0 171.7 MB

Next generation BLAS implementation for ROCm platform

Home Page: https://rocm.docs.amd.com/projects/rocBLAS/en/latest/

License: Other

C 7.51% C++ 57.33% CMake 0.59% Python 1.72% Shell 21.95% Groovy 0.16% Asymptote 0.06% Fortran 10.67% Makefile 0.01%
blas rocm hip

rocblas's Introduction

rocblas's People

Contributors

aferoz21 avatar alexbrownamd avatar amcamd avatar amdkila avatar amgddm avatar babakpst avatar bragadeesh avatar cgmb avatar daineamd avatar dependabot[bot] avatar eidenyoshida avatar guacamoleo avatar leekillough avatar mahmoodw avatar nakajee avatar naveenelumalaiamd avatar pkamd avatar rkamd avatar rocmmathlibrariesbot avatar rosenrodt avatar saadrahim avatar sdquiring avatar smalekta avatar tingxingdong avatar tonyyhsieh avatar torrezuk avatar wbgilmartin avatar yoichiyoshida avatar yvanmokwinski avatar zaliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rocblas's Issues

Prebuilt ROCm libraries

I don't have sudo permission. I found installing these ROCm libraries (e.g. ROCmBLAS) in my home directories are painful and slow when the OS is CentOS. Could you please consider providing prebuilt libraries for users who do not have sudo permission ?

Thank you

race condition of device_matrix_copy() and device_strided_batched_matrix_copy()

What is the expected behavior

  • Matrix C is supported to be copied to D correctly by calling the function hipMemcpy() in the function device_strided_batched_matrix_copy() and device_matrix_copy().
    We should replace the hipMemcpy() with the async version hipMemcpyAsync() to ensure computation and memory copy happen in the same stream.

What actually happens

  • There is race condition between the memory copy and the later gemm computation.

How to reproduce

Environment

Hardware description
GPU device string
CPU device string
Software version
ROCK v0.0
ROCR v0.0
HCC v0.0
Library v0.0

Wiki pages got b0rked

Dunno if github is having a brain fart or what, but the wiki pages that were around yesterday don't seem to exist anymore

Build issue error: invalid instruction v_accvgpr_read v3, a3 // 0000000010EC: D3D84003 18000103

What is the expected behavior

build finishes successfully
Never been an issue ./install.sh -idc has been bulletproof.

What actually happens

One module seems to be misbehaving

/home/rocm/rocBLAS/build/release/assembly/Cijk_Alik_Bljk_BBH_MT32x32x32_SE_APM1_AF0EM8_AF1EM1_AMAS3_ASEM8_BL1_DTL0_EPS1_FL0_GRVW2_GSU1_ISA908_IU1_K1_KLA_LPA0_LPB0_LDL1_NLCA1_NLCB1_PBD0_PK0_PGR1_PLR1_RK1_SU32_SNLL0_TT2_2_USFGRO1_VAW1_VW2_WG16_16_1_WGM8.s:1088:3: error: invalid instruction 
v_accvgpr_read v3, a3 // 0000000010EC: D3D84003 18000103 
warning: ISA: (9, 0, 8)  is not supported; overriding with  (9, 0, 0)
Generating kernels: Done.
10793
Compiling source kernels: Launching 64 threads...
Compiling source kernels: Done.
clang-9: error: no such file or directory: 'Cijk_Alik_Bljk_BBH_MT32x32x32_SE_APM1_AF0EM8_AF1EM1_AMAS3_ASEM8_BL1_DTL0_EPS1_FL0_GRVW2_GSU1_ISA908_IU1_K1_KLA_LPA0_LPB0_LDL1_NLCA1_NLCB1_PBD0_PK0_PGR1_PLR1_RK1_SU32_SNLL0_TT2_2_USFGRO1_VAW1_VW2_WG16_16_1_WGM8.o'
Traceback (most recent call last):

How to reproduce

-  ./install.sh -idc

Environment

Hardware description
GPU MI25
CPU EPYC 32 core
Software version
Ubuntu 16.04.6 LTS GNU/Linux 4.15.0-43-generic x86_64
ROCK v2.7.1.2
ROCR v0.0
HCC v0.0
Library v0.0

Thanks.

rocblas_hgemm gives incorrect results

How to reproduce

  • unzip the Makefile.gz and main.cpp.gz into an empty folder
  • Run make float to run the testcase with float datatype
  • Run make half to run the testcase with half datatype

The testcase does a simple matmul of two 3x3 matrices
The testcase can be run for the

  • float datatype (rocblas_sgemm is called) (gives correct answer)
  • half datatype (rocblas_hgemm is called) (given incorrect answer)

What is the expected behavior

make float results are the expected/correct ones
A and B are the input matrices, C is the output

root@deven-dev01:/common/hip_examples/rocblas/gemm_fp16# make float
rm -rf ./a.out
/opt/rocm/bin/hipcc -I/opt/rocm/rocblas/include -std=c++11 -lrocblas -lhip_hcc main.cpp
./a.out
hgemm example
NN: m, n, k, lda, ldb, ldc = 3, 3, 3, 3, 3, 3
A : [1, 2, 3, 4, 5, 6, 7, 8, 9, ]
B : [9, 8, 7, 6, 5, 4, 3, 2, 1, ]
C : [90, 114, 138, 54, 69, 84, 18, 24, 30, ]

What actually happens

make half results are the incorrect ones

Notice the values in the third column of C are incorrect!

root@deven-dev01:/common/hip_examples/rocblas/gemm_fp16# make half
rm -rf ./a.out
/opt/rocm/bin/hipcc -I/opt/rocm/rocblas/include -std=c++11 -lrocblas -lhip_hcc main.cpp -DTEST_HALF
./a.out
hgemm example
NN: m, n, k, lda, ldb, ldc = 3, 3, 3, 3, 3, 3
A : [1, 2, 3, 4, 5, 6, 7, 8, 9, ]
B : [9, 8, 7, 6, 5, 4, 3, 2, 1, ]
C : [90, 114, 114, 54, 69, 69, 18, 24, 24, ]

Makefile.gz
main.cpp.gz

Issue: “Error: There is no device can be used to do the computation”, while executing MXNet tests on HIP/CUDA path after integrating rocBLAS

Issue Summary
Facing runtime issues on integrating rocBLAS to MXNet library--“error: There is no device can be used to do the computation”. (Test cases terminate after throwing this error)
Tests are executed on HIP/CUDA (NVCC path).
If we use cuBLAS in place of rocBLAS, this issue doesn't appear.

Steps to reproduce:
1)clone mxnet
$ git clone --recursive https://github.com/ROCmSoftwarePlatform/mxnet

2)steps to build mxnet
$ cd mxnet
$ export HIP_PLATFORM=nvcc
$ make -j(nproc)

  1. Install Mxnet
    $ cd python
    $ sudo python setup.py

  2. Test case to run
    $ cd ../example/adversary
    $ python adversary_generation.py

Error Log:
[13:07:58] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28)
[13:07:58] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28)
There is no device can be used to do the computation

Analysis: Though the error indicates absence of device, the gpu is being detected. The image of nvidia-smi log has been attached.
nvidia-smi

rocBLAS-3.3.0 leaves undefined symbol `llvm::yaml::IO::getContext()' but does not request lib to satisfy it

What is the expected behavior

  • I can link libs with rocblas.so built from rocBLAS 3.3.0

What actually happens

  • I get an error in both rocALUTION-3.3.0 and miopen

ld: /usr/lib64/librocblas.so.0.1: undefined reference to `llvm::yaml::IO::getContext()'

How to reproduce

  • build rocBLAS
    attempt to link against it by building rocALUTION or miopen

Environment

Hardware description
GPU Vega64
CPU Threadripper
Software version
ROCK v3.3.0
ROCR v3.3.0
HCC v3.3.0
Library v3.3.0

regression due to __shared__ variable for hip-clang

The following line recently caused regression in hip-clang:
https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/ed0d2b161eb6c28ed36732d10843e433d359b1bc/library/src/blas1/rocblas_amax_amin.h#L30

errror msg:
[08:53:58][Step 2/2] clang: warning: argument unused during compilation: '--hip-device-lib-path=/opt/rocm/hip/lib/bitcode' [-Wunused-command-line-argument]
[08:53:59][Step 2/2] In file included from /work/rocblas/library/src/blas1/rocblas_amin.cpp:6:
[08:53:59][Step 2/2] In file included from /work/rocblas/library/src/blas1/rocblas_amax_amin.h:11:
[08:53:59][Step 2/2] /work/rocblas/library/src/blas1/reduction.h:154:19: error: initialization is not supported for shared variables.
[08:53:59][Step 2/2] shared To tmp[NB];
[08:53:59][Step 2/2] ^~~
[08:53:59][Step 2/2] /work/rocblas/library/src/blas1/reduction.h:231:25: note: in instantiation of function template specialization 'rocblas_reduction_kernel_part1<1024, rocblas_fetch_amax_amin, rocblas_reduce_amin, float, index_value_t >' requested here
[08:53:59][Step 2/2] hipLaunchKernelGGL((rocblas_reduction_kernel_part1<NB, FETCH, REDUCE>),
[08:53:59][Step 2/2] ^
[08:53:59][Step 2/2] /work/rocblas/library/src/blas1/rocblas_amax_amin.h:173:19: note: in instantiation of function template specialization 'rocblas_reduction_kernel<1024, rocblas_fetch_amax_amin, rocblas_reduce_amin, rocblas_finalize_amax_amin, float, index_value_t, int>' requested here
[08:53:59][Step 2/2] auto status = rocblas_reduction_kernel<NB,
[08:53:59][Step 2/2] ^
[08:53:59][Step 2/2] /work/rocblas/library/src/blas1/rocblas_amax_amin.h:193:12: note: in instantiation of function template specialization 'rocblas_iamaxmin<float, float>' requested here
[08:53:59][Step 2/2] return rocblas_iamaxmin(handle, n, x, incx, result);

This is because hip-clang does not allow default constructor of __shared__ variables to initialize data members. This is because __shared__ variables cannot be initialized in either amdgpu or nvptx target. CUDA has the same restriction.

The fix is to remove initialization of data members from the default constructor and add an init() function to do the initialization, and call that function separately.

Issues building 2.10 from source on Arch Linux

I am attempting to build the ROCm 2.10 stack from source on Arch Linux. I've been able to do it previously with 2.6, but 2.10 refuses to cooperate.

I had to deal with a number of problems, but I got through ROCT-ThunkInterface, ROCR-Runtime, ROCm-CompilerSupport, hcc, hip. But I'm having no luck with rocBLAS.

The first problem is that it assumes the existence of a number of ROCm executables and libraries in the wrong places. This includes, among others:
/opt/rocm/bin/hcc
/opt/rocm/lib/libhip_hcc.so
/opt/rocm/bin/lpl
None of these files are put there by "make install" commands of hip & hcc. There are /opt/rocm/hcc/bin/hcc, /opt/rocm/hip/lib/libhip_hcc.so, etc.

I do see symlinks in those locations on an Ubuntu system with a prepackaged ROCm install. I'm not sure who creates them, but their existence probably shouldn't be assumed.
It also assumes that hcc-config, hipconfig, extractkernel are in the PATH, which is unjustified.

After creating all missing symlinks I could find, I run into the following.

When building with
CXX=/opt/rocm/hcc/bin/hcc cmake ..
or
CXX=/opt/rocm/bin/hcc cmake ..

cmake fails with

CMake Error at /opt/rocm/hcc/lib/cmake/hcc/hcc-config.cmake:44 (add_library):
add_library cannot create imported target "hsa-runtime64" because another
target with the same name already exists.
Call Stack (most recent call first):
/usr/share/cmake-3.16/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
/opt/rocm/hip/lib/cmake/hip/hip-config.cmake:77 (find_dependency)
CMakeLists.txt:203 (find_package)

This is caused by /opt/rocm/hcc/lib/cmake/hcc/hcc-config.cmake being included twice. I thought of adding "include_guard(GLOBAL)" to hcc-config.cmake, but that just results in a different error:

Target "TensileHost" links to target "hcc::hccrt" but the target was not found. Perhaps a find_package() call is missing for an IMPORTED target, or an ALIAS target is missing?
Target "TensileHost" links to target "hcc::hc_am" but the target was not found. Perhaps a find_package() call is missing for an IMPORTED target, or an ALIAS target is missing?

This makes no sense to me at all, because I can see those targets in the cmake trace.

When building with

CXX=/opt/rocm/bin/hipcc cmake ..

I get

[zen@zenpc build]$ make
Scanning dependencies of target TensileHost
[ 1%] Building CXX object Tensile/lib/CMakeFiles/TensileHost.dir/source/AMDGPU.cpp.o
In file included from /home/zen/rocm/rocBLAS/build/virtualenv/lib/python3.8/site-packages/Tensile/Source/lib/source/AMDGPU.cpp:27:
In file included from /home/zen/rocm/rocBLAS/build/virtualenv/lib/python3.8/site-packages/Tensile/Source/lib/include/Tensile/AMDGPU.hpp:29:
/home/zen/rocm/rocBLAS/build/virtualenv/lib/python3.8/site-packages/Tensile/Source/lib/include/Tensile/Tensile.hpp:55:9: error: virtual member function is not supported
virtual HIPCC_BUILD ~ProblemInputs();
^
1 error generated.
In file included from /home/zen/rocm/rocBLAS/build/virtualenv/lib/python3.8/site-packages/Tensile/Source/lib/source/AMDGPU.cpp:27:
In file included from /home/zen/rocm/rocBLAS/build/virtualenv/lib/python3.8/site-packages/Tensile/Source/lib/include/Tensile/AMDGPU.hpp:29:
/home/zen/rocm/rocBLAS/build/virtualenv/lib/python3.8/site-packages/Tensile/Source/lib/include/Tensile/Tensile.hpp:55:9: error: virtual member function is not supported
virtual HIPCC_BUILD ~ProblemInputs();
^
1 error generated.
make[2]: *** [Tensile/lib/CMakeFiles/TensileHost.dir/build.make:63: Tensile/lib/CMakeFiles/TensileHost.dir/source/AMDGPU.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:154: Tensile/lib/CMakeFiles/TensileHost.dir/all] Error 2

Finally, when I try to use the provided install.sh script, after hacking it to not to throw an error because it does not know how to install prerequisite packages (I've installed them all manually), I get

Compiling source kernels: Launching 16 threads...
Unrecognised token: -D__HIP_HCC_COMPAT_MODE__
Died at /opt/rocm/bin/hipcc line 360.
Traceback (most recent call last):
File "/home/zen/rocm/rocBLAS/build/release/virtualenv/lib/python3.8/site-packages/Tensile/Common.py", line 1194, in apply_print_exception
return func(*args)
File "/home/zen/rocm/rocBLAS/build/release/virtualenv/lib/python3.8/site-packages/Tensile/TensileCreateLibrary.py", line 165, in buildSourceCodeObjectFile
subprocess.check_call(compileArgs)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/rocm/bin/hipcc', '--genco', '-D__HIP_HCC_COMPAT_MODE__=1', '-I', '/home/zen/rocm/rocBLAS/build/release/Tensile', '--amdgpu-target=gfx803', '--amdgpu-target=gfx900', '--amdgpu-target=gfx906', '--amdgpu-target=gfx908', '/home/zen/rocm/rocBLAS/build/release/Tensile/Kernels.cpp', '-c', '-o', '/home/zen/rocm/rocBLAS/build/release/code_object_tmp/Kernels.so']' returned non-zero exit status 2.

This is because Tensile here https://github.com/ROCmSoftwarePlatform/Tensile/blame/develop/Tensile/TensileCreateLibrary.py#L158 tries to launch hipcc with the command line options "--genco -D__HIP_HCC_COMPAT_MODE__=1", and these options are invalid (you can run "/opt/rocm/bin/hipcc --genco -D__HIP_HCC_COMPAT_MODE__=1" and no further options and it'll bail with the exact same error).

After puzzling over this, I tried to see if I can even build standalone Tensile, and I can't:

zen@zenpc build]$ CXX=/opt/rocm/bin/hipcc cmake ..
-- The C compiler identification is GNU 9.2.0
-- The CXX compiler identification is Clang 10.0.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rocm/bin/hipcc
-- Check for working CXX compiler: /opt/rocm/bin/hipcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Processing hcc-config.cmake
-- Found PkgConfig: /usr/bin/pkg-config (found version "1.6.3")
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Could NOT find Boost (missing: Boost_INCLUDE_DIR program_options)
-- Found ROCmSMI: /opt/rocm/rocm_smi/lib/librocm_smi64.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/zen/rocm/Tensile/Tensile/Source/build
[zen@zenpc build]$ make
Scanning dependencies of target TensileHost
[ 3%] Building CXX object lib/CMakeFiles/TensileHost.dir/source/AMDGPU.cpp.o
In file included from /home/zen/rocm/Tensile/Tensile/Source/lib/source/AMDGPU.cpp:27:
In file included from /home/zen/rocm/Tensile/Tensile/Source/lib/include/Tensile/AMDGPU.hpp:29:
/home/zen/rocm/Tensile/Tensile/Source/lib/include/Tensile/Tensile.hpp:55:9: error: virtual member function is not supported
virtual HIPCC_BUILD ~ProblemInputs();
^
1 error generated.

[zen@zenpc build]$ CXX=/opt/rocm/bin/hcc cmake ..
-- The C compiler identification is GNU 9.2.0
-- The CXX compiler identification is Clang 10.0.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Processing hcc-config.cmake
-- Found PkgConfig: /usr/bin/pkg-config (found version "1.6.3")
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Could NOT find Boost (missing: Boost_INCLUDE_DIR program_options)
-- Found ROCmSMI: /opt/rocm/rocm_smi/lib/librocm_smi64.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/zen/rocm/Tensile/Tensile/Source/build
[zen@zenpc build]$ make
Scanning dependencies of target TensileHost
[ 3%] Building CXX object lib/CMakeFiles/TensileHost.dir/source/AMDGPU.cpp.o
[ 6%] Building CXX object lib/CMakeFiles/TensileHost.dir/source/ContractionProblem.cpp.o
[ 9%] Building CXX object lib/CMakeFiles/TensileHost.dir/source/ContractionSolution.cpp.o
[ 12%] Building CXX object lib/CMakeFiles/TensileHost.dir/source/DataTypes.cpp.o
[ 16%] Building CXX object lib/CMakeFiles/TensileHost.dir/source/Debug.cpp.o
[ 19%] Building CXX object lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o
In file included from /home/zen/rocm/Tensile/Tensile/Source/lib/source/EmbeddedLibrary.cpp:29:
In file included from /home/zen/rocm/Tensile/Tensile/Source/lib/include/Tensile/EmbeddedLibrary.hpp:32:
/home/zen/rocm/Tensile/Tensile/Source/lib/include/Tensile/Singleton.hpp:39:26: error: call to implicitly-deleted default constructor of 'Tensile::EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >'
static Class instance;
^
/home/zen/rocm/Tensile/Tensile/Source/lib/include/Tensile/EmbeddedData.hpp:62:39: note: in instantiation of member function 'Tensile::LazySingleton<Tensile::EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> > >::Instance' requested here
auto const& items = Base::Instance().items;
^
/home/zen/rocm/Tensile/Tensile/Source/lib/source/EmbeddedLibrary.cpp:42:82: note: in instantiation of member function 'Tensile::EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >::Get' requested here
auto const& data = EmbeddedData<SolutionLibrary<MyProblem, MySolution>>::Get(key);
^
/home/zen/rocm/Tensile/Tensile/Source/lib/include/Tensile/EmbeddedData.hpp:84:9: note: explicitly defaulted function was implicitly deleted here
EmbeddedData() = default;
^
/home/zen/rocm/Tensile/Tensile/Source/lib/include/Tensile/EmbeddedData.hpp:87:21: note: default constructor of 'EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >' is implicitly deleted because field 'empty' of const-qualified type 'const Tensile::EmbeddedData::Items' (aka 'const vector<vector >') would not be initialized
const Items empty;
^
1 error generated.

Any suggestions (apart from sticking to Ubuntu)?

Using cmake find_package(rocblas) does not report the library

What is the expected behavior

  • Using the CMakeLists.txt:
    cmake_minimum_required(VERSION 3.9)
    find_package( rocblas )
    message( ${rocblas_INCLUDE_DIRS} )
    message( ${rocblas_LIBRARIES} )

should report include directories and libraries

What actually happens

  • "cmake" gives the following output:

/usr/rocblas/include
rocblas-targets

How to reproduce

  • Run a cmake configuration with find_package(rocblas)
Software version
ROCK v1.9
ROCR v1.9
HCC v1.9
rocBLAS development branch

Unable to build clients

I am unable to build clients on Ubuntu 18.04 (gcc 7.4.0), with either the rocm-3.0 tag or the head of develop.

Building with install.sh -c fails due to the error from #845, and I'm unable to patch the source of Tensile to get around it because install.sh just wipes everything and checks out the code from scratch.

I tried to build clients only by hand. The process is not clearly documented but I did the following.

  • Create clients/build
  • Change into clients/build
  • Run cmake -DBUILD_CLIENTS_BENCHMARKS=ON ..
  • Run make

This fails too. If I run cmake as above, with no additional environment variables, it fails with a ton of errors starting with

/home/zen/devel/rocBLAS/clients/benchmarks/../include/rocblas.hpp:36:1: error: explicit template specialization cannot have a storage class
static constexpr auto rocblas_scal = rocblas_sscal;
^~~~~~

If, instead, I do
CXX=/opt/rocm/bin/hcc cmake -DBUILD_CLIENTS_BENCHMARKS=ON ..
or
CXX=/opt/rocm/bin/hipcc cmake -DBUILD_CLIENTS_BENCHMARKS=ON ..

it fails with

/home/zen/devel/rocBLAS/clients/benchmarks/../include/rocblas_datatype2string.hpp:21:11: error: 'auto' return without trailing return type; deduced return types are a C++14 extension
constexpr auto rocblas2char_operation(rocblas_operation value)
^

I figured out how to force rocBLAS to use C++14, but that just kicks the can down the road a bit, since then it fails with

[ 50%] Building CXX object benchmarks/CMakeFiles/rocblas-bench.dir/client.cpp.o
In file included from /home/zen/devel/rocBLAS/clients/benchmarks/client.cpp:7:
In file included from /home/zen/devel/rocBLAS/clients/benchmarks/../include/rocblas_data.hpp:8:
In file included from /home/zen/devel/rocBLAS/clients/benchmarks/../include/rocblas_arguments.hpp:10:
/home/zen/devel/rocBLAS/clients/benchmarks/../include/rocblas_math.hpp:30:48: error: use of undeclared identifier 'is_complex'; did you mean '_Complex'?
template <typename T, typename std::enable_if<!is_complex, int>::type = 0>
^~~~~~~~~~
_Complex
/home/zen/devel/rocBLAS/clients/benchmarks/../include/rocblas_math.hpp:30:48: error: expected unqualified-id
/home/zen/devel/rocBLAS/clients/benchmarks/../include/rocblas_math.hpp:30:69: error: parameter declarator cannot be qualified
template <typename T, typename std::enable_if<!is_complex, int>::type = 0>
~~^

What's the correct build process?

Building rocBLAS 2.10 from source - error: call to implicitly-deleted default constructor

What is the expected behavior

  • Successful build of rocBLAS

What actually happens

  • Build fails with the attached error.

How to reproduce

Environment

Hardware description
GPU Radeon RX 560
CPU Intel Core i5
Software version
ROCK v2.10
ROCR v2.10
HCC v2.10
Library v2.10

The build error:

[3/81] /usr/lib/hcc/2.10/bin/hcc -DTENSILE_DEFAULT_SERIALIZATION -DTENSILE_USE_HIP -Ivirtualenv/lib/python3.5/site-packages/Tensile/Source/lib/include -I/usr/lib/llvm/6/include -isystem /usr/lib/hip/include -isystem /usr/lib/hcc/2.10/include  -DNDEBUG --amdgpu-target=gfx803 -hc -fopenmp=libomp   -hc -fPIC -std=c++14 -MD -MT Tensile/lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o -MF Tensile/lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o.d -o Tensile/lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o -c virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/source/EmbeddedLibrary.cpp
FAILED: Tensile/lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o 
/usr/lib/hcc/2.10/bin/hcc -DTENSILE_DEFAULT_SERIALIZATION -DTENSILE_USE_HIP -Ivirtualenv/lib/python3.5/site-packages/Tensile/Source/lib/include -I/usr/lib/llvm/6/include -isystem /usr/lib/hip/include -isystem /usr/lib/hcc/2.10/include  -DNDEBUG --amdgpu-target=gfx803 -hc -fopenmp=libomp   -hc -fPIC -std=c++14 -MD -MT Tensile/lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o -MF Tensile/lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o.d -o Tensile/lib/CMakeFiles/TensileHost.dir/source/EmbeddedLibrary.cpp.o -c virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/source/EmbeddedLibrary.cpp
clang-10: warning: argument unused during compilation: '--amdgpu-target=gfx803' [-Wunused-command-line-argument]
In file included from virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/source/EmbeddedLibrary.cpp:29:
In file included from virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/include/Tensile/EmbeddedLibrary.hpp:32:
virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/include/Tensile/Singleton.hpp:39:26: error: call to implicitly-deleted default constructor of 'Tensile::EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >'
            static Class instance;
                         ^
virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/include/Tensile/EmbeddedData.hpp:62:39: note: in instantiation of member function 'Tensile::LazySingleton<Tensile::EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> > >::Instance' requested here
            auto const& items = Base::Instance().items;
                                      ^
virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/source/EmbeddedLibrary.cpp:42:82: note: in instantiation of member function 'Tensile::EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >::Get' requested here
        auto const& data = EmbeddedData<SolutionLibrary<MyProblem, MySolution>>::Get(key);
                                                                                 ^
virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/include/Tensile/EmbeddedData.hpp:84:9: note: explicitly defaulted function was implicitly deleted here
        EmbeddedData() = default;
        ^
virtualenv/lib/python3.5/site-packages/Tensile/Source/lib/include/Tensile/EmbeddedData.hpp:87:21: note: default constructor of 'EmbeddedData<Tensile::SolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >' is implicitly deleted because field 'empty' of const-qualified type 'const Tensile::EmbeddedData::Items' (aka 'const vector<vector<unsigned char> >') would not be initialized
        const Items empty;
                    ^
1 error generated.

I changed

const Items empty;

in "Tensile/Source/lib/include/Tensile/EmbeddedData.hpp" to:

const Items empty = {};

then it compiles. Is it an acceptable solution?

Build fails with -lpthread

Hi,

I am trying to build rocBLAS following the build procedure. The minor exception: I am on archlinux, thus I had to do some adaptations. But the general steps are the same.

What actually happens

I have several screen of error messages:

warning: ISA: (8, 0, 3)  is not supported; overriding with  (9, 0, 0)
warning: ISA: (8, 0, 3)  is not supported; overriding with  (9, 0, 0)
warning: ISA: (8, 0, 3)  is not supported; overriding with  (9, 0, 0)
warning: ISA: (8, 0, 3)  is not supported; overriding with  (9, 0, 0)
warning: ISA: (8, 0, 3)  is not supported; overriding with  (9, 0, 0)
warning: ISA/home/me/rocBLAS-git/src/rocBLAS/build/assembly/asm.sh: line 7: None : command not found
/home/me/rocBLAS-git/src/rocBLAS/build/assembly/asm.sh: line 8: None : command not found
/home/me/rocBLAS-git/src/rocBLAS/build/assembly/asm.sh: line 7: None : command not found
/home/me/rocBLAS-git/src/rocBLAS/build/assembly/asm.sh: line 8: None : command not found
/home/me/rocBLAS-git/src/rocBLAS/build/assembly/asm.sh: line 7: None : command not found
/home/me/rocBLAS-git/src/rocBLAS/build/assembly/asm.sh: line 8: None : command not found
/home/me/rocBLAS-git/src/rocBLAS/build/assembly/asm.sh: line 7: None : command not found

I truncated the output, but it is basically this one that just repeat itself for hundreds of lines.
As far as I understand it, it is an error because the asm.sh refers to the variable ASM, which is not resolved. What should it be? And I guess it has to be defined with an export right before the cmake call?

The second one is about the lpthread error. Which is the blocking one. I get the following error:

-- Boost version: 1.68.0
-- Found the following Boost libraries:
--   program_options
-- Configuring incomplete, errors occurred!
See also "/home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeOutput.log".
See also "/home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeError.log".

And contained in CMakeError.log:

Determining if the pthread_create exist failed with the following output:
Change Dir: /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/make" "cmTC_27287/fast"
/usr/bin/make -f CMakeFiles/cmTC_27287.dir/build.make CMakeFiles/cmTC_27287.dir/build
make[1] : on entre dans le répertoire « /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp »
Building CXX object CMakeFiles/cmTC_27287.dir/CheckSymbolExists.cxx.o
/usr/bin/c++     -o CMakeFiles/cmTC_27287.dir/CheckSymbolExists.cxx.o -c /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx
Linking CXX executable cmTC_27287
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_27287.dir/link.txt --verbose=1
/usr/bin/c++       CMakeFiles/cmTC_27287.dir/CheckSymbolExists.cxx.o  -o cmTC_27287 
/usr/bin/ld: CMakeFiles/cmTC_27287.dir/CheckSymbolExists.cxx.o: in function `main':
CheckSymbolExists.cxx:(.text+0x1b): undefined reference to `pthread_create'
collect2: error: ld a retourné le statut de sortie 1
make[1]: *** [CMakeFiles/cmTC_27287.dir/build.make:87: cmTC_27287] Error 1
make[1] : on quitte le répertoire « /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp »
make: *** [Makefile:121: cmTC_27287/fast] Error 2

File /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp/CheckSymbolExists.cxx:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/make" "cmTC_98f38/fast"
/usr/bin/make -f CMakeFiles/cmTC_98f38.dir/build.make CMakeFiles/cmTC_98f38.dir/build
make[1] : on entre dans le répertoire « /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp »
Building CXX object CMakeFiles/cmTC_98f38.dir/CheckFunctionExists.cxx.o
/usr/bin/c++    -DCHECK_FUNCTION_EXISTS=pthread_create   -o CMakeFiles/cmTC_98f38.dir/CheckFunctionExists.cxx.o -c /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CheckLibraryExists/CheckFunctionExists.cxx
Linking CXX executable cmTC_98f38
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_98f38.dir/link.txt --verbose=1
/usr/bin/c++   -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_98f38.dir/CheckFunctionExists.cxx.o  -o cmTC_98f38 -lpthreads 
/usr/bin/ld : ne peut trouver -lpthreads
collect2: error: ld a retourné le statut de sortie 1
make[1]: *** [CMakeFiles/cmTC_98f38.dir/build.make:87: cmTC_98f38] Error 1
make[1] : on quitte le répertoire « /home/me/rocBLAS-git/src/rocBLAS/CMakeFiles/CMakeTmp »
make: *** [Makefile:121: cmTC_98f38/fast] Error 2

Thus, I tried to add the following options to cmake : "-DCMAKE_CXX_FLAGS=-lpthread -DCMAKE_EXE_LINKER_FLAGS=-lpthread". So far, no luck.
Any idea?

Environment

Archlinux

Hardware description
GPU AMD Radeon R8 M445DX
CPU AMD FX-9800P

Thank you for your help!
Cheers.

ship rocblas-bench binary by default?

This is not an issue but an inquiry so I'm skipping the template.

Wondering would it be possible to ship rocblas-bench by default, so users working on applications / frameworks can use ROCBLAS_LAYER and TENSILE_DB env vars to quickly identify if a config has been tuned and report back, without rebuilding rocBLAS with potentially incorrect YAML config for Tensile.

Fallback operation on RX Vega 64

I have a Radeon RX Vega 64 on Ubuntu 18.04.3 LTS (recommended ROCm OS).


$ lspci -nn
0e:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] [1002:687f] (rev c1)
$ /opt/rocm/bin/rocm-smi --showdriverversion
Driver version: 5.0.71
$ /opt/rocm/bin/rocm-smi --showproductname
GPU[0] 		: Card series:		Vega 10 XT [Radeon RX Vega 64]
GPU[0] 		: Card vendor:		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] 		: Card SKU:		D05001

All rocBLAS operations are ignoring the optimized path (vega10 Tensile kernels) and using the fallback path.

It seems to be happening because rocBLAS (via Tensile) selects the kernel by comparing the device-name string returned by hipGetDeviceProperties() against the list of "registered" device-names, which, for the Vega 10 family, includes Device 6863, Device 6862, Device 687f, Device 6860, Device 6861, 'Vega 10 XTX [Radeon Vega Frontier Edition]', 'Vega [Radeon RX Vega]' (https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/library/src/blas3/Tensile/Logic/asm_full/vega10_Cijk_Alik_Bljk_SB.yaml#L4), but not Vega 10 XT [Radeon RX Vega 64].

My card is, presumably, supposed to be covered by "Device 687f", but that's not the string rocBLAS is getting from the system.

Build fails on 2.0.0, works on develop

What is the expected behavior

Build works fine

What actually happens

Build fails with this error when linking rocblas.so.2.0.0.0

ninja -v    
[1/2  50%(33.607)] : && /opt/rocm/hcc/bin/hcc -fPIC -march=native -O2 -pipe -fstack-protector-strong -fno-plt -O3 -DNDEBUG  -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -shared -Wl,-soname,librocblas.so.0 -o library/src/librocblas.so.2.0.0.0 library/src/CMakeFiles/rocblas.dir/blas_ex/rocblas_gemm_ex.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_trtri.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_trtri_batched.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_geam.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/Tensile/gemm.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_trsm.cpp.o library/src/CMakeFiles/rocblas.dir/blas2/rocblas_gemv.cpp.o library/src/CMakeFiles/rocblas.dir/blas2/rocblas_ger.cpp.o library/src/CMakeFiles/rocblas.dir/blas2/rocblas_syr.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/fetch_template.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_amin.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_amax.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_asum.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_axpy.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_copy.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_dot.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_nrm2.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_scal.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_swap.cpp.o library/src/CMakeFiles/rocblas.dir/handle.cpp.o library/src/CMakeFiles/rocblas.dir/utility.cpp.o library/src/CMakeFiles/rocblas.dir/rocblas_auxiliary.cpp.o library/src/CMakeFiles/rocblas.dir/status.cpp.o library/src/CMakeFiles/rocblas.dir/buildinfo.cpp.o  -Wl,-rpath,/opt/rocm/hip/lib:/opt/rocm/lib: --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 library/src/libtensile-rocblas.a /opt/rocm/hip/lib/libhip_hcc.so -hc -L /opt/rocm/hcc/lib -Wl,-rpath /opt/rocm/hcc/lib -Wl,--whole-archive /opt/rocm/hcc/lib/libmcwamp.so -Wl,--no-whole-archive -ldl -lm /opt/rocm/hcc/lib/libhc_am.so /opt/rocm/lib/libhsa-runtime64.so --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 && :
FAILED: library/src/librocblas.so.2.0.0.0 
: && /opt/rocm/hcc/bin/hcc -fPIC -march=native -O2 -pipe -fstack-protector-strong -fno-plt -O3 -DNDEBUG  -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -shared -Wl,-soname,librocblas.so.0 -o library/src/librocblas.so.2.0.0.0 library/src/CMakeFiles/rocblas.dir/blas_ex/rocblas_gemm_ex.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_trtri.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_trtri_batched.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_geam.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/Tensile/gemm.cpp.o library/src/CMakeFiles/rocblas.dir/blas3/rocblas_trsm.cpp.o library/src/CMakeFiles/rocblas.dir/blas2/rocblas_gemv.cpp.o library/src/CMakeFiles/rocblas.dir/blas2/rocblas_ger.cpp.o library/src/CMakeFiles/rocblas.dir/blas2/rocblas_syr.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/fetch_template.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_amin.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_amax.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_asum.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_axpy.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_copy.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_dot.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_nrm2.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_scal.cpp.o library/src/CMakeFiles/rocblas.dir/blas1/rocblas_swap.cpp.o library/src/CMakeFiles/rocblas.dir/handle.cpp.o library/src/CMakeFiles/rocblas.dir/utility.cpp.o library/src/CMakeFiles/rocblas.dir/rocblas_auxiliary.cpp.o library/src/CMakeFiles/rocblas.dir/status.cpp.o library/src/CMakeFiles/rocblas.dir/buildinfo.cpp.o  -Wl,-rpath,/opt/rocm/hip/lib:/opt/rocm/lib: --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 library/src/libtensile-rocblas.a /opt/rocm/hip/lib/libhip_hcc.so -hc -L /opt/rocm/hcc/lib -Wl,-rpath /opt/rocm/hcc/lib -Wl,--whole-archive /opt/rocm/hcc/lib/libmcwamp.so -Wl,--no-whole-archive -ldl -lm /opt/rocm/hcc/lib/libhc_am.so /opt/rocm/lib/libhsa-runtime64.so --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 && :
Call parameter type does not match function signature!
  %StackGuardSlot = alloca i8*, addrspace(5)
 i8**  call void @llvm.stackprotector(i8* %0, i8* addrspace(5)* %StackGuardSlot)
in function _ZN5trtri16gemm_trsm_kernelIdLi64EEEviiPKT_iS3_iS3_iPS1_i
LLVM ERROR: Broken function found, compilation aborted!
Generating AMD GCN kernel failed in llc for target: gfx900
Call parameter type does not match function signature!
  %StackGuardSlot = alloca i8*, addrspace(5)
 i8**  call void @llvm.stackprotector(i8* %0, i8* addrspace(5)* %StackGuardSlot)
in function _ZN5trtri16gemm_trsm_kernelIdLi64EEEviiPKT_iS3_iS3_iPS1_i
LLVM ERROR: Broken function found, compilation aborted!
Generating AMD GCN kernel failed in llc for target: gfx803
Call parameter type does not match function signature!
  %StackGuardSlot = alloca i8*, addrspace(5)
 i8**  call void @llvm.stackprotector(i8* %0, i8* addrspace(5)* %StackGuardSlot)
in function _ZN5trtri16gemm_trsm_kernelIdLi64EEEviiPKT_iS3_iS3_iPS1_i
LLVM ERROR: Broken function found, compilation aborted!
Generating AMD GCN kernel failed in llc for target: gfx906
clang-8: error: linker command failed with exit code 7 (use -v to see invocation)
ninja: build stopped: subcommand failed.

How to reproduce

I am packaging up ROCm components on Archlinux, so if I made a configuration mistake there, it might be difficult to repro.

Environment

Hardware description
GPU RX Vega 64
CPU TR 1950X
Software version
ROCK upstream 4.19.10
ROCR 2.0.0
HCC 2.0.0
Library 2.0.0

trtri API not LAPACK compatible

What is the expected behavior

  • LAPACK's trtri API is
subroutine dtrtri ( character UPLO, character DIAG, integer N, double precision, dimension( lda, * ) A, integer LDA, integer INFO) 

What actually happens

  • rocBLAS API is
ROCBLAS_EXPORT rocblas_status rocblas_dtrtri(rocblas_handle handle,
                                             rocblas_fill uplo,
                                             rocblas_diagonal diag,
                                             rocblas_int n,
                                             const double* A,
                                             rocblas_int lda,
                                             double* invA,
                                             rocblas_int ldinvA);

Note A (in place, LAPACK) vs A+invA (out of place, rocBLAS) arguments.

Question

I am currently looking into implementing getri (which needs trtri) and I would just like to know if trtri API is going to stay like this or not (in order to code around it)? Thanks!

Undefined references 'rocblas_gemm_ex' and 'rocblas_gemm_strided_batched_ex' when trying to link with MIOpenDriver

What is the expected behavior

'rocblas_gemm_ex' and 'rocblas_gemm_strided_batched_ex' being defined in librocblas.so

What actually happens

'rocblas_gemm_ex' and 'rocblas_gemm_strided_batched_ex' are undefined in librocblas.so

How to reproduce

Build rocBLAS with these build options:
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=rocblas&id=7c9965639ddb3c7c578bffac3e3b58317f0022ab

And try to build MIOpen with these build options:
https://pastebin.com/vWwBmyiV

It then fails at "Linking CXX executable ../bin/MIOpenDriver" with "/opt/rocm/rocblas/rocblas/lib/librocblas.so.0.1" given on the command line:

ld: ../lib/libMIOpen.so.1: undefined reference to `rocblas_gemm_ex'
ld: ../lib/libMIOpen.so.1: undefined reference to `rocblas_gemm_strided_batched_ex'

Environment

Hardware description
GPU AMD Radeon VII
CPU AMD Ryzen 9 3950x
Software version
OS Arch Linux
Upstream kernel driver 5.5.2-arch1-1
ROCR v3.0
HCC v3.0
Library v3.0

Is Tensile necessary for these functions to be compiled? When trying to configure with

-DBUILD_WITH_TENSILE=ON

it throws a couple of python exceptions while doing so...

rocBLAS_sgemm -- CreateKernel(): Unable to create kernel

What is the expected behavior

  • Caffe runs without error

What actually happens

I1010 17:49:43.093858  1775 caffe.cpp:251] Starting Optimization
I1010 17:49:43.093892  1775 solver.cpp:279] Solving CIFAR10_quick
I1010 17:49:43.093900  1775 solver.cpp:280] Learning Rate Policy: fixed
I1010 17:49:43.094336  1775 solver.cpp:337] Iteration 0, Testing net (#0)

Backtrace:
0x00007fc35b16219d:	Kalmar::HSADevice::CreateKernel(char const*, Kalmar::KalmarQueue*) + 0x1a7d
0x00007fc373246909:	hc::completion_future hc::parallel_for_each<Cijk_Alik_Bljk_SB_MT016x016x16_GSU08_KLS_NLCA01_NLCB01_PGR1_PLR1_TT04_04_VW01_WG04_04_08_WGM08(float*, float const*, float const*, float, float, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, ihipStream_t*, unsigned int, ihipEvent_t**, ihipEvent_t**)::HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_nam0x00007fc37322f130:	Cijk_Alik_Bljk_SB_MT016x016x16_GSU08_KLS_NLCA01_NLCB01_PGR1_PLR1_TT04_04_VW01_WG04_04_08_WGM08(float*, float const*, float const*, float, float, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, ihipStream_t*, unsigned int, ihipEvent_t**, ihipEvent_t**) + 0x1e0
0x00007fc373197dd9:	rocblas_sgemm + 0x209
0x00007fc3793202ad:	hipblasSgemm + 0x3d
0x00000000006d35e6:	void caffe::caffe_gpu_gemm<float>(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, float, float const*, float const*, float, float*) + 0x96
0x0000000000667cd5:	caffe::InnerProductLayer<float>::Forward_gpu(std::vector<caffe::Blob<float>*, std::allocator<caffe::Blob<float>*> > const&, std::vector<caffe::Blob<float>*, std::allocator<caffe::Blob<float>*> > const&) + 0x105
0x0000000000438727:	caffe::Layer<float>::Forward(std::vector<caffe::Blob<float>*, std::allocator<caffe::Blob<float>*> > const&, std::vector<caffe::Blob<float>*, std::allocator<caffe::Blob<float>*> > const&) + 0x2f7
0x0000000000708647:	caffe::Net<float>::ForwardFromTo(int, int) + 0xb7
0x0000000000708570:	caffe::Net<float>::Forward(float*) + 0x20
0x000000000080cb4a:	caffe::Solver<float>::Test(int) + 0x21a
0x000000000080b7d8:	caffe::Solver<float>::Step(int) + 0x3d8
0x000000000080b043:	caffe::Solver<float>::Solve(char const*) + 0x153
0x0000000000435903:	train() + 0x853
0x0000000000439022:	main + 0x1b2
0x00007fc3742cf830:	__libc_start_main + 0xf0
0x0000000000859af9:	_start + 0x29

HSADevice::CreateKernel(): Unable to create kernel Cijk_Alik_Bljk_SB_MT016x016x16_GSU08_KLS_NLCA01_NLCB01_PGR1_PLR1_TT04_04_VW01_WG04_04_08_WGM08 
  CreateKernel_raw=  _ZZ94Cijk_Alik_Bljk_SB_MT016x016x16_GSU08_KLS_NLCA01_NLCB01_PGR1_PLR1_TT04_04_VW01_WG04_04_08_WGM08PfPKfS1_ffjjjjjjjjjjjjjP12ihipStream_tjPP11ihipEvent_tS6_EN67HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_name_end_819__cxxamp_trampolineES_jjjjjj
  CreateKernel_demangled=  Cijk_Alik_Bljk_SB_MT016x016x16_GSU08_KLS_NLCA01_NLCB01_PGR1_PLR1_TT04_04_VW01_WG04_04_08_WGM08(float*, float const*, float const*, float, float, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, ihipStream_t*, unsigned int, ihipEvent_t**, ihipEvent_t**)::HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_name_end_8::__cxxamp_trampoline(float*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int)
*** Aborted at 1507657783 (unix time) try "date -d @1507657783" if you are using GNU date ***
PC: @     0x7fc3742e4428 gsignal
*** SIGABRT (@0x6ef) received by PID 1775 (TID 0x7fc37b42ab80) from PID 1775; stack trace: ***
    @     0x7fc37a52e390 (unknown)
    @     0x7fc3742e4428 gsignal
    @     0x7fc3742e602a abort
    @     0x7fc35b1622ce Kalmar::HSADevice::CreateKernel()
    @     0x7fc373246909 hc::parallel_for_each<>()
    @     0x7fc37322f130 Cijk_Alik_Bljk_SB_MT016x016x16_GSU08_KLS_NLCA01_NLCB01_PGR1_PLR1_TT04_04_VW01_WG04_04_08_WGM08()
    @     0x7fc373197dd9 rocblas_sgemm
    @     0x7fc3793202ad hipblasSgemm
    @           0x6d35e6 caffe::caffe_gpu_gemm<>()
    @           0x667cd5 caffe::InnerProductLayer<>::Forward_gpu()
    @           0x438727 caffe::Layer<>::Forward()
    @           0x708647 caffe::Net<>::ForwardFromTo()
    @           0x708570 caffe::Net<>::Forward()
    @           0x80cb4a caffe::Solver<>::Test()
    @           0x80b7d8 caffe::Solver<>::Step()
    @           0x80b043 caffe::Solver<>::Solve()
    @           0x435903 train()
    @           0x439022 main
    @     0x7fc3742cf830 __libc_start_main
    @           0x859af9 _start
    @                0x0 (unknown)
Aborted (core dumped)

How to reproduce

In the most recently built hipCaffe image, run this:

./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
time ./build/tools/caffe train --solver=examples/cifar10/cifar10_quick_solver.prototxt

Environment

Hardware description
GPU gfx900
CPU
Software version
ROCK 4.11.0-kfd-compute-rocm-rel-1.6-175
ROCR 1.1.6-83-g8b72125
HCC 1.0.17412
Library 0.7.0.0

Fix build when python3 is default

What is the expected behavior

You should be able to build the library regardless of whether you have python3 on your system

What actually happens

The scripts just do a very simplistic "well, if /usr/bin/python exists, use that" but that is very likely
to be a symlink to either python3 or python2. If it IS python3, then the build fails

How to reproduce

Have your /usr/bin/python be a symlink pointing to python3 instead instead of python2

Environment

Observed in Arch Linux which will have python3 as the default.

Fix suggestion and discussion

I initially did a survey of three possible options but upon closer look turns out the problem was simpler than I initially surmised so the simplest one turned out to be sufficient. I'm making the fork and the PR

TL;DR: I want to have this fixed and don't mind doing it. But need to know what solution you prefer

~I came to this issue through a current effort to adapt the scripts in Experimental ROC to Arch Linux, since I didn't seem to have a lot of luck with the AUR packages. I'm picking up small details here and there in the build scripts of the individual packages and am willing to help, especially when, such as in situations like these, it's an easy-looking tweak.
The problem seems to be located in cmake/virtualenv.cmake. There actually IS a TODO comment related to this and a commented out call to the cmake module findPythonInterp.
This seems to be unnecessary as a quick fix is to simply wrap the find_program call in the other condition:

find_program(VIRTUALENV_PYTHON_EXE python2)
if(NOT VIRTUALENV_PYTHON_EXE)
    find_program(VIRTUALENV_PYTHON_EXE python)
endif()

No complex double supoort

What is the expected behavior

  • Export complex double functions

What actually happens

  • No complex double support in v1.9

rocBlas might select underperforming blas kernels

I found the error when comparing a MI50 using rocm2.5 and a RadeonVII using rocm2.4.
The RadeonVII had much better performance under rocm2.4.
Using the rocm-profiler I found the both GPUs were calling a different blas kernel.
RadeonVII was calling a much faster one (7x faster).
When upgrading to rocm2.5 the RadeonVII started calling the slower kernel.

How to reproduce

  • using pytorch provided example using batch_size=256 line 33. The batch size is important the issue disappears with batch-size=2048.
  • python main.py

The script makes two calls to rocblas. The first call is the offending one.

  • ./rocblas-bench -f gemm -r f32_r --transposeA T --transposeB N -m 256 -n 1 -k 4 --alpha 1 --lda 4 --ldb 4 --beta 1 --ldc 256
  • ./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 4 -n 1 -k 256 --alpha 1 --lda 4 --ldb 256 --beta 0 --ldc 4

On a RadeonVII with rocm2.4 the first one will translate to

  • Cijk_Ailk_Bljk_SB_MT16x16x16_SE_APM1_AF0EM1_AF1EM1_AMAS1_ASEM1_BL1_DTL0_EPS1_FL0_GRVW2_GSU4_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_MGWVW1_NLCA1_NLCB1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT2_2_USFGRO1_VAW1_VW2_WG8_8_4_WGM1
  • NB: ISA906

While on MI50 with rocm2.5 it will translate to

  • Cijk_Ailk_Bljk_SB_MT32x32x2_SE_APM1_AF0EM1_AF1EM1_AMAS2_ASEM1_BL0_DTL0_EPS0_FL0_GRVW4_GSU1_ISA000_IU1_K1_KLS_LPA0_LPB0_LDL1_MGWVW1_NLCA1_NLCB1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT4_4_USFGRO0_VAW1_VW4_WG8_8_1_WGM1
  • NB: ISA000

Unfortunately I was not able to compile rocblas-bench but using the profiler I found
the performance values below for the offending kernel.

GPU DurationNs ROCBlas Version
MI50 - gfx906 59363.84 2.2.6 d8417e9-dirty
Radeon VII - gfx906 8589.28 2.2.5 49ec204-dirty
Radeon VII - gfx906 57458.24 2.2.6 d8417e9-dirty

More details here
merged_rocprof.xlsx

Environment

MI50 Machine

Package OS OS_version package_version arch
rocblas Ubuntu 16.04,now 2.2.6.0 amd64
rock-dkms Ubuntu 16.04,now 2.5-27 all
rocm-clang-ocl Ubuntu 16.04,now 0.4.0-7ce124f amd64
rocm-dev Ubuntu 16.04,now 2.5.27 amd64
rocm-device-libs Ubuntu 16.04,now 0.0.1 amd64
rocm-dkms Ubuntu 16.04,now 2.5.27 amd64
rocm-libs Ubuntu 16.04,now 2.5.27 amd64
rocm-opencl Ubuntu 16.04,now 1.2.0-2019060327 amd64
rocm-opencl-dev Ubuntu 16.04,now 1.2.0-2019060327 amd64
rocm-utils Ubuntu 16.04,now 2.5.27 amd64

 

[Questions]-1.Enbale log Traces in Kerenel & API 2. Google Test Running test cases shown-0

1.How to enable Traces(Printf) or log_traces in Kernel API(Inside gemvc_kernel()) Example: and Routine API(rocblas_status rocblas_gemv_impl())

Example1:
rocblas_status rocblas_gemv_impl() -->enabled Printf
rocblas_status gemvc_kernel( ) -- >Enable printf

[Run Binary from Build Directory] ./example-sgemm

Enable Printf traces not displayed
Printf may be under development in Rocm
https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/HIP-GUIDE.html#printf

Example2:
export PATH=$PATH:/opt/rocm/bin
export LD_LIBRARY_PATH=/opt/rocm/rocblas/lib:$LD_LIBRARY_PATH
export ROCBLAS_LAYER=1
export ROCBLAS_LOG_TRACE_PATH=/home/user/rocBLAS/log_trace
[Run binary from Build Directory] ./example-sgemm or ./rocblas-test
Traces not displayed or Trace logging outputs in directory log_trace

  1. Google Test:
    on Google test run shows Running Test cases -0 . Kindly let me know correct command.

Example1:
/rocblas-test --gtest_filter=checkingemmfloat-batched:NaN
rocBLAS version: 2.12.1.1749-rocm-rel-3.0-6-ca5535b

Query device success: there are 2 devices

Device ID 0 : Vega 20

Note: Google Test filter = checkingemmfloat-batched:NaN
[==========] Running 0 tests from 0 test cases.
[==========] 0 tests from 0 test cases ran. (56 ms total)
[ PASSED ] 0 tests.

And also,I would like know is arguments can be pass in gtest_filter similar like ./rocblas-bench -f gemm -r s -m 1024 -n 1024 -k 1024 --transposeB T -v 1
Kindly help me out on above questions

rocblas_dgemm symbol is missing from librocblas.so

What is the expected behavior

  • rocblas_dgemm symbol is implemented in librocblas.so

What actually happens

  • rocblas_dgemm symbol is missing from librocblas.so, so libhipblas.so fails to load:
    /usr/lib64/libhipblas.so: error: symbol lookup error: undefined symbol: rocblas_dgemm (fatal)

How to reproduce

  • readelf -s librocblas.so | grep dgemm

Environment

Hardware description
GPU gfx900
CPU Ryzen CPU
Software version
ROCK v4.13-rocm1.7.0
ROCR v1.7
HCC v1.1.17503
Library v0.10.3.0

ubuntu 18.04 cannot build rocBLAS with ROCm 2.10

What is the expected behavior

  • install.sh could build rocBLAS successfully

What actually happens

  • get stuck in the middle
0         1
012345678901234
..########|||..
0         1
01234567890123456
..########|||..||
info: growing pool += 2 * 2 for GlobalWrite

0         1         2
012345678901234567890123456789
................#########||..|
0         1         2         3
0123456789012345678901234567890
................#########||..||
info: growing pool += 2 * 2 for GlobalWrite

0         1         2
012345678901234567890123456789
................#########||..|
0         1         2         3
0123456789012345678901234567890
................#########||..||
(get stuck here)

How to reproduce

  • install ROCm 2.10
  • git clone rocBLAS 2.10 or 2.10.1 branch
  • ./install.sh

Environment

Hardware description
GPU vega20
CPU Intel i7-8700K 3.7GHz 12 CPUs
Software version
ROCK v2.10
ROCR v2.10
HCC v2.10
Library v2.10

Stack overflow in rocblas_*gemm_batched

I am attempting to convert Tensorflow to use rocblas_*gemm_batched API calls instead of rocblas_*gemm_strided_batched.

One of my tests fails with a segfault. I've traced it down to the following code

https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/library/src/blas3/Tensile/gemm.hpp#L542-L544

It fails because the test involves batch count 500000, which requires 500000*3*8=12 MB of stack, which exceeds the Linux default of 8M.

Additionally, this code seems to launch one kernel per batch entry, which is unacceptable performance-wise.

Doesn't compile with HIP from ROCM 1.4

What is the expected behavior

  • compiles with AMD's HIP

What actually happens

cd /home/a/rocBLAS/b/library-build/src && /opt/rocm/hip/bin/hipcc    -I/home/a/rocBLAS/library/include -I/home/a/rocBLAS/library/src/include -I/home/a/rocBLAS/b/library-build/include -I/home/a/rocBLAS/b/library-build/src  -m64 -g -fvisibility=hidden -fvisibility-inlines-hidden   -std=gnu++98 -o CMakeFiles/rocblas.dir/blas1/rocblas_axpy.cpp.o -c /home/a/rocBLAS/library/src/blas1/rocblas_axpy.cpp
/home/a/rocBLAS/library/src/blas1/rocblas_axpy.cpp:41:36: error: invalid operands to binary expression ('const float2' and 'const float2')
        y[tid * incy] +=  (*alpha) * (x[tid * incx]);
                          ~~~~~~~~ ^ ~~~~~~~~~~~~~~~
/home/a/rocBLAS/library/src/blas1/rocblas_axpy.cpp:110:41: note: in instantiation of function template specialization 'axpy_kernel_device_scalar<float2>' requested
      here
        hipLaunchKernel(HIP_KERNEL_NAME(axpy_kernel_device_scalar), dim3(blocks), dim3(threads), 0, rocblas_stream, n, alpha, x, incx, y, incy);
                                        ^

How to reproduce

  • install ubuntu 16.04.1, update, install ROCM, git clone rocBLAS,
  • cmake -DCMAKE_BUILD_TYPE=Debug -DBUILD_LIBRARY=ON -DBUILD_CLIENTS=ON -DBUILD_WITH_TENSILE=OFF -DHOST_TOOLCHAIN_NAME=gcc -DDEVICE_TOOLCHAIN_NAME=hipcc -DHIP_ROOT=/opt/rocm/hip
  • make fails

Environment

Hardware description
GPU Radeon R9 Nano
CPU Pentium G4400
Software version
ROCK 4.6.0-kfd-compute-rocm-rel-1.4-16
ROCR 1.4.0
HCC HCC clang version 3.5.0 (based on HCC 0.10.16501-81f0a2f-02246a0 LLVM 3.5.0svn)
HIP HIP version: 1.0.16503

(BTW almost identical problem exists with rocFFT)

build rocBLAS failed using hip-clang

I followed the instructions here building HIP-clang, rocm-device-libs and HIP from the latest source. Then I tried to build rocBLAS with hip-clang (using install.sh as instructed here):

./install.sh --hip-clang

But failed with error:

...
[ 16%] Building CXX object library/src/CMakeFiles/Tensile.dir/__/__/Tensile/Kernels.cpp.o
[  3%] Building CXX object library/src/CMakeFiles/Tensile.dir/__/__/Tensile/SolutionHelper.cpp.o
[ 16%] Building CXX object library/src/CMakeFiles/Tensile.dir/__/__/Tensile/Tensile.cpp.o
[ 16%] Building CXX object library/src/CMakeFiles/Tensile.dir/__/__/Tensile/Tools.cpp.o
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:161:25: error: use of undeclared identifier 'hc_get_workitem_id'
  unsigned int serial = hc_get_workitem_id(0);
                        ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:221:23: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wg0I = hc_get_group_id(0);
                      ^~~~~~~~~~~~~~~
                      __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:222:23: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wg1J = hc_get_group_id(1);
                      ^~~~~~~~~~~~~~~
                      __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:223:24: error: use of undeclared identifier 'hc_get_num_groups'; did you mean '__ockl_get_num_groups'?
  unsigned int nwg0I = hc_get_num_groups(0);
                       ^~~~~~~~~~~~~~~~~
                       __ockl_get_num_groups
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:373:30: note: '__ockl_get_num_groups' declared here
extern "C" __device__ size_t __ockl_get_num_groups(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:224:24: error: use of undeclared identifier 'hc_get_num_groups'; did you mean '__ockl_get_num_groups'?
  unsigned int nwg1J = hc_get_num_groups(1);
                       ^~~~~~~~~~~~~~~~~
                       __ockl_get_num_groups
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:373:30: note: '__ockl_get_num_groups' declared here
extern "C" __device__ size_t __ockl_get_num_groups(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:278:24: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wgK = ( hc_get_group_id(2) ) % sizeK;
                       ^~~~~~~~~~~~~~~
                       __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:1039:25: error: use of undeclared identifier 'hc_get_workitem_id'
  unsigned int serial = hc_get_workitem_id(0);
                        ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:1116:23: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wg0I = hc_get_group_id(0);
                      ^~~~~~~~~~~~~~~
                      __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:1117:23: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wg1J = hc_get_group_id(1);
                      ^~~~~~~~~~~~~~~
                      __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:1118:24: error: use of undeclared identifier 'hc_get_num_groups'; did you mean '__ockl_get_num_groups'?
  unsigned int nwg0I = hc_get_num_groups(0);
                       ^~~~~~~~~~~~~~~~~
                       __ockl_get_num_groups
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:373:30: note: '__ockl_get_num_groups' declared here
extern "C" __device__ size_t __ockl_get_num_groups(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:1119:24: error: use of undeclared identifier 'hc_get_num_groups'; did you mean '__ockl_get_num_groups'?
  unsigned int nwg1J = hc_get_num_groups(1);
                       ^~~~~~~~~~~~~~~~~
                       __ockl_get_num_groups
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:373:30: note: '__ockl_get_num_groups' declared here
extern "C" __device__ size_t __ockl_get_num_groups(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:1173:24: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wgK = ( hc_get_group_id(2) ) % sizeK;
                       ^~~~~~~~~~~~~~~
                       __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:2402:25: error: use of undeclared identifier 'hc_get_workitem_id'
  unsigned int serial = hc_get_workitem_id(0);
                        ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:2469:23: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wg0I = hc_get_group_id(0);
                      ^~~~~~~~~~~~~~~
                      __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:2470:23: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wg1J = hc_get_group_id(1);
                      ^~~~~~~~~~~~~~~
                      __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:2471:24: error: use of undeclared identifier 'hc_get_num_groups'; did you mean '__ockl_get_num_groups'?
  unsigned int nwg0I = hc_get_num_groups(0);
                       ^~~~~~~~~~~~~~~~~
                       __ockl_get_num_groups
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:373:30: note: '__ockl_get_num_groups' declared here
extern "C" __device__ size_t __ockl_get_num_groups(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:2472:24: error: use of undeclared identifier 'hc_get_num_groups'; did you mean '__ockl_get_num_groups'?
  unsigned int nwg1J = hc_get_num_groups(1);
                       ^~~~~~~~~~~~~~~~~
                       __ockl_get_num_groups
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:373:30: note: '__ockl_get_num_groups' declared here
extern "C" __device__ size_t __ockl_get_num_groups(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:2526:24: error: use of undeclared identifier 'hc_get_group_id'; did you mean '__ockl_get_group_id'?
  unsigned int wgK = ( hc_get_group_id(2) ) % sizeK;
                       ^~~~~~~~~~~~~~~
                       __ockl_get_group_id
/opt/rocm/include/hip/hcc_detail/hip_runtime.h:363:30: note: '__ockl_get_group_id' declared here
extern "C" __device__ size_t __ockl_get_group_id(uint);
                             ^
/data/zhangtong/workspace/rocBLAS/build/release/Tensile/Kernels.cpp:3716:25: error: use of undeclared identifier 'hc_get_workitem_id'
  unsigned int serial = hc_get_workitem_id(0);
                        ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]

Please help. Thanks in advance.

Build failed with ROCm 2.5 & 2.3

What is the expected behavior

What actually happens

error out when make install , may related to ROCm/HIP#485

How to reproduce

rocm repo rocm-2.3

make
make install , on ROCm 2.3
then error out

Environment

Hardware description
GPU device string
amd vega
CPU device string

| Software | version |
ROCm 2.3

pls reopen it
env: rocm-2.3 , Ubuntu 18.04

meet err when make rocBLAS

~/ROCm/rocBLAS/release# make install
Scanning dependencies of target Tensile
[ 3%] Building CXX object library/src/CMakeFiles/Tensile.dir///Tensile/Kernels.cpp.o
Stack dump:
0. Program arguments: /opt/rocm/hcc/bin/llc -mtriple amdgcn-amd-amdhsa -mcpu=gfx900 -mattr=-code-object-v3 -O2 -amdgpu-function-calls=0 -filetype=obj -o /tmp/tmp.r7iSdMgsmb/kernel-gfx900.hsaco.isabin /tmp/tmp.r7iSdMgsmb/kernel-gfx900.hsaco.opt.bc

Running pass 'CallGraph Pass Manager' on module '/tmp/tmp.r7iSdMgsmb/kernel-gfx900.hsaco.opt.bc'.
Running pass 'Prologue/Epilogue Insertion & Frame Finalization' on function '@Cijk_Ailk_Bjlk_4xi8BH_MT96x128x16_APM1_AF0EM1_AF1EM1_AMAS2_ASEM1_BL0_DTL0_EPS0_FL0_GRVW2_GSU1_ISA000_IU1_K1_KLS_LPA0_LPB0_LDL1_MGWVW1_NLCA3_NLCB1_PK0_PGR1_PLR1_SU32_SNLL0_TT6_8_USFGRO0_VAW1_VW2_WG16_16_1_WGM8'
/opt/rocm/hcc/bin/llc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x40)[0xaaaac00b1ee8]
/opt/rocm/hcc/bin/clamp-device: line 250: 54708 Segmentation fault (core dumped) $LLC -mtriple amdgcn-amd-amdhsa -mcpu=$AMDGPU_TARGET $CODE_OBJECT_FORMAT $KMOPTLLC -amdgpu-function-calls=$AMDGPU_FUNC_CALLS -filetype=obj -o $2.isabin $2.opt.bc
Generating AMD GCN kernel failed in llc for target: gfx900
Error: hc-kernel-assemble[175]: failed with status -1
clang-9: error: HC assembler command failed with exit code 255 (use -v to see invocation)
library/src/CMakeFiles/Tensile.dir/build.make:62: recipe for target 'library/src/CMakeFiles/Tensile.dir///Tensile/Kernels.cpp.o' failed
make[2]: *** [library/src/CMakeFiles/Tensile.dir///Tensile/Kernels.cpp.o] Error 255
CMakeFiles/Makefile2:142: recipe for target 'library/src/CMakeFiles/Tensile.dir/all' failed
make[1]: *** [library/src/CMakeFiles/Tensile.dir/all] Error 2
Makefile:151: recipe for target 'all' failed
make: *** [all] Error 2

Build on internal version of ROCm?

Hi,

Our team is trying to build the rocBLAS on our machines with the internal ROCm version installed. Because the internal ROCm version is not signed, the current rocBLAS install.sh will exit non-zero on 'sudo apt update'. I wonder if it is possible we can build the rocBLAS on the latest ROCm bits?'

Thanks,
Qiyu

Build is stuck on Tensile Create Library

What is the expected behavior

What actually happens

  • build gets stuck on ################################################################################# Tensile Create Library
  • no process activity is shown ater this message appears

How to reproduce

  • yay -Sa miopen

Environment

| Hardware | description |
| GPU | gfx803;gfx900;gfx906 |
| CPU | Ryzen 2500U |

| Software | version |
| HIP | 2.0.0-3 |
| ROCR | 2.0.0-2 |
| HCC | 2.0.0-3 |

roc::rocblas target and C++ compilers

If you use roc::rocblas target (defined in rocblas-config.cmake) in your cmake script to link to rocBLAS, you have to use hcc as you C++ compiler. I think this happens because hip::hip_hcc in here is marked PUBLIC, which means its dependencies are transitive and compilation options are passed to roc::rocblas (I understand that this is so the HIP dirs are included). Unfortunately, some of those options work only with hcc compiler.

I just want to know if it's supposed to be that way (roc::rocblas works with hcc, if you want to use different compiler use rocblas_INCLUDE_DIRS , rocblas_LIB_INSTALL_DIR, and -lrocblas), or if this is work-in-progress.

rocblas_hgemm is not implement

In this version we can not test the half-precision performance, because I found that function rocblas_hgemm have not been implement.
How can we test the half-precision performance in VEGA10-MI25?

CentOS 7 Build from source issues

Could you please give instructions on how users can build ROCBLAS on a CentOS7 machine ?

Thanks,

/nfs/home/zjin/rocBLAS_new/build/release/virtualenv/pip install git+https://github.com/ROCmSoftwarePlatform/Tensile.git@362cf90c35914a2c05633a31a9cf33e573737f99
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Collecting git+https://github.com/ROCmSoftwarePlatform/Tensile.git@362cf90c35914a2c05633a31a9cf33e573737f99
Cloning https://github.com/ROCmSoftwarePlatform/Tensile.git (to 362cf90c35914a2c05633a31a9cf33e573737f99) to /tmp/pip-i9fd3p41-build
Could not find a tag or branch '362cf90c35914a2c05633a31a9cf33e573737f99', assuming commit.
Collecting pyyaml (from Tensile==4.13.0)
Installing collected packages: pyyaml, Tensile
Running setup.py install for Tensile: started
Running setup.py install for Tensile: finished with status 'done'
Successfully installed Tensile-4.13.0 pyyaml-5.1.2
-- using GIT Tensile fork=ROCmSoftwarePlatform from branch=fe4f721886d07eef6251cea4225e027181022aa5
-- The C compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1")
-- Found OpenMP_C: -fopenmp (found version "3.1")
CMake Error at /usr/share/cmake3/Modules/FindOpenMP.cmake:328 (try_compile):
Cannot copy output executable

''

to destination specified by COPY_FILE:

'/nfs/home/zjin/rocBLAS/build/release/CMakeFiles/FindOpenMP/ompver_CXX.bin'

Unable to find the executable at any of:

/nfs/home/zjin/rocBLAS/build/release/CMakeFiles/CMakeTmp/cmTC_dd937
/nfs/home/zjin/rocBLAS/build/release/CMakeFiles/CMakeTmp/Debug/cmTC_dd937
/nfs/home/zjin/rocBLAS/build/release/CMakeFiles/CMakeTmp/Development/cmTC_dd937

Call Stack (most recent call first):
/usr/share/cmake3/Modules/FindOpenMP.cmake:453 (_OPENMP_GET_SPEC_DATE)
build/release/virtualenv/lib/python3.7/site-packages/Tensile/Source/CMakeLists.txt:44 (find_package)

-- Found OpenMP_CXX: -fopenmp=libomp
-- Found OpenMP: TRUE (found version "3.1")
CMake Error at build/release/virtualenv/lib/python3.7/site-packages/Tensile/Source/lib/CMakeLists.txt:48 (find_package):
Could not find a package configuration file provided by "LLVM" (requested
version 7.0
) with any of the following names:

LLVMConfig.cmake
llvm-config.cmake

### I found the Tensile library in the master branch has removed the requirement of LLVM 7.0. After manually removing the 7.0 requirement and using LLVM 10.0, I saw many errors about invalid instructions for the Vega20 GPU (ISA = gfx906) in the course of building the library.

Warning: batch index [2,0] should be in SetConstStrideB
Warning: batch index [2,0] should be in SetConstStrideB
Warning: batch index [2,0] should be in SetConstStrideB
Reading logic files: Done.
[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% (122.9 secs elapsed)

Generating kernels: Launching 16 threads...
warning: ISA: (9, 0, 8) is not supported; overriding with (9, 0, 0)
warning: ISA: (9, 0, 8) is not supported; overriding with (9, 0, 0)

L0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL1_NLCA1_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT4_6_USFGRO0_VAW1_VW1_WG16_8_1_WGM64.s']' returned non-zero exit statu s 1.
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2095:24: error: invalid operand for instruction
v_mul_lo_u32 v50, v41, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2108:24: error: invalid operand for instruction
v_mul_lo_u32 v51, v41, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2121:24: error: invalid operand for instruction
v_mul_lo_u32 v52, v41, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2135:24: error: invalid operand for instruction
v_mul_lo_u32 v53, v41, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2149:24: error: invalid operand for instruction
v_mul_lo_u32 v54, v41, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2163:24: error: invalid operand for instruction
v_mul_lo_u32 v55, v41, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2728:24: error: invalid operand for instruction
v_mul_lo_u32 v30, v29, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2738:24: error: invalid operand for instruction
v_mul_lo_u32 v31, v29, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2748:24: error: invalid operand for instruction
v_mul_lo_u32 v32, v29, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2759:24: error: invalid operand for instruction
v_mul_lo_u32 v33, v29, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2770:24: error: invalid operand for instruction
v_mul_lo_u32 v34, v29, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2781:24: error: invalid operand for instruction
v_mul_lo_u32 v35, v29, constStrideC0I // addrCalc <- scaled extracted dim
^
/nfs/home/zjin/rocBLAS_new/build/release/assembly/Cijk_Ailk_Bljk_SB_MT192x32x8_SE_APM1_AF0EM1_AF1EM1_AMAS0_ASEM1_BL1_DTL0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB1_LDL 1_NLCA3_NLCB1_PBD1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT6_4_USFGRO0_VAW1_VW1_WG32_8_1_WGM64.s:2884:24: error: invalid operand for instruction
v_mul_lo_u32 v30, v29, constStrideC0I // addrCalc <- scaled extracted dim
^

rocBLAS cannot be used without a HIP compiler

What is the expected behavior

When using rocBLAS as a library, it should not be necessary to build the application using a HIP compiler.

What actually happens

Because rocBLAS uses HIP features in its public API, it's necessary to build with a compiler that supports HIP. This could be solved by wrapping, as part of rocBLAS's API, the parts of the HIP API that are necessary to use rocBLAS.

Installation (./install.sh -d) freezes rocm-terminal docker instance

What is the expected behavior

  • System remains stable and successfully completes compilation

What actually happens

  • System freezes before compilation successfully completes, requiring a hard power reset

How to reproduce

Environment

  • HP dv6 laptop
  • GPU - NVIDIA GeForce GT 650M
  • CPU - Intel Core i7-3630QM
  • OS - Ubuntu 18.04.2
  • Docker Container - rocm-terminal:2.5

Build rocblas-bench failed show fetching file fail

What is the expected behavior

  • Should build rocblas-bench successful, and find frocblas in stagging folder

What actually happens

  • Failed, and show fetch file fail with attach picture
    wechat screenshot_20181019143231

  • And i try to run 'sudo apt-get update' show same error, seem like caused by source.list
    Err:4 file:/tmp/rocm1.9 xenial/main amd64 Packages
    File not found - /tmp/rocm1.9/dists/xenial/main/binary-amd64/Packages (2: No such file or directory)
    Get:5 file:/tmp/rocm1.9 xenial/main Translation-en_US
    Ign:5 file:/tmp/rocm1.9 xenial/main Translation-en_US
    Hit:13 http://storage.googleapis.com/bazel-apt stable InRelease
    Hit:14 http://security.ubuntu.com/ubuntu bionic-security InRelease
    Hit:15 http://cn.archive.ubuntu.com/ubuntu bionic InRelease
    Hit:16 http://cn.archive.ubuntu.com/ubuntu bionic-updates InRelease
    Get:17 http://cn.archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
    Hit:18 https://download.docker.com/linux/ubuntu bionic InRelease
    Fetched 74.6 kB in 2s (46.1 kB/s)
    Reading package lists... Done
    N: Ignoring file 'rocm.listsudo' in directory '/etc/apt/sources.list.d/' as it has an invalid filename extension
    E: Failed to fetch file:/tmp/rocm1.9/dists/xenial/main/binary-amd64/Packages File not found - /tmp/rocm1.9/dists/xenial/main/binary-amd64/Packages (2: No such file or directory)

sources.txt

How to reproduce

  • Run bellow command
    BASE="rocBLAS"
    BRANCH= "v14.3.0"
    #"master"

if [ -n "$1" ]; then
BASE=$1
fi

if [ -n "$2" ]; then
BRANCH=$2
fi

PULL_DIR="${BASE}-${BRANCH}"

git clone -b v14.3.0 https://github.com/ROCmSoftwarePlatform/rocBLAS.git ${PULL_DIR}
cd ${PULL_DIR}
./install.sh -dc 2>&1 | tee make.out

Environment

Hardware description
GPU Vega10
Software version
Compute Packages:

ii hcc 1.2.18354 amd64 HCC: An Open Source, Optimizing C++ Compiler for Heterogeneous Compute
ii hip_base 1.5.18353 amd64 HIP: Heterogenous-computing Interface for Portability [BASE]
ii hip_doc 1.5.18353 amd64 HIP: Heterogenous-computing Interface for Portability [DOCUMENTATION]
ii hip_hcc 1.5.18353 amd64 HIP: Heterogenous-computing Interface for Portability [HCC]
ii hip_samples 1.5.18353 amd64 HIP: Heterogenous-computing Interface for Portability [SAMPLES]
ii hsa-amd-aqlprofile 1.0.0 amd64 AQLPROFILE library for AMD HSA runtime API extension support
ii hsa-ext-rocr-dev 1.1.9-8-g51c00c2 amd64 AMD Heterogeneous System Architecture HSA - Linux HSA Runtime extensions for ROCm platforms
ii hsa-rocr-dev 1.1.9-8-g51c00c2 amd64 AMD Heterogeneous System Architecture HSA - Linux HSA Runtime for ROCm platforms
ii hsakmt-roct 1.0.9-8-g238782c amd64 HSAKMT library for AMD KFD support
ii hsakmt-roct-dev 1.0.9-8-g238782c amd64 HSAKMT development package.
ii rocm-clang-ocl 0.3.0-7997136 amd64 OpenCL compilation with clang compiler.
ii rocm-cmake 0.2.0-6240bb3 amd64 rocm-cmake built using CMake
ii rocm-dev 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-device-libs 0.0.1 amd64 Radeon Open Compute - device libraries
ii rocm-dkms 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-opencl 1.2.0-2018090737 amd64 OpenCL/ROCm
ii rocm-opencl-dev 1.2.0-2018090737 amd64 OpenCL/ROCm
ii rocm-smi 1.0.0-72-gec1da05 amd64 System Management Interface for ROCm
ii rocm-utils 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocminfo 1.0.0 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool
ii rocr_debug_agent 1.0.0 amd64 Radeon Open Compute (ROCm) Runtime debug agent

hipblasSgemm()/rocblas_sgemm() is extremely slow on Vega 20

Vega 20 has 10% more TFlops than Vega 64, and I verified it does on most common cases like Conv2D from Resnet, etc.

However, for rocblas_gemm(), it is unexpected very slow, e.g. computing a 1024x1024 matrix multiplication takes over 300% time than the same compute is spent on Vega 64.

So any suggestions or optimization plans?

compilation error with hip-clang

There is another compilation error for hip-clang:

  • [ 23%] Building CXX object library/src/CMakeFiles/rocblas.dir/blas2/rocblas_trsv.cpp.o

hipcc-cmd: /opt/rocm/llvm/bin/clang++ --hip-device-lib-path=/home/yaxunl/git/rocdl/rel/dist/lib -std=c++11 -isystem /opt/rocm/llvm/bin/../lib/clang/9.0.0/include -Xclang -fallow-half-arguments-and-returns -D__HIP_HCC_COMPAT_MODE__=1 -I/opt/rocm/hip/include -DHIP_VERSION_MAJOR=1 -DHIP_VERSION_MINOR=5 -DHIP_VERSION_PATCH=18391 --cuda-gpu-arch=gfx900 -D__HIP_ARCH_GFX900__=1 -O3 -fgpu-rdc -DBUILD_WITH_TENSILE=1 -DTensile_RUNTIME_LANGUAGE_HIP=1 -DTensile_RUNTIME_LANGUAGE_OCL=0 -Drocblas_EXPORTS -I/home/yaxunl/git/rocblas/rocBLAS/library/include -I/home/yaxunl/git/rocblas/rocBLAS/library/src/include -I/home/yaxunl/git/rocblas/rocBLAS/build/release/include -I/home/yaxunl/git/rocblas/rocBLAS/library/src/blas3/Tensile -I/home/yaxunl/git/rocblas/rocBLAS/build/release/Tensile -O3 -DNDEBUG -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -std=c++14 -o CMakeFiles/rocblas.dir/blas2/rocblas_trsv.cpp.o -c -x hip /home/yaxunl/git/rocblas/rocBLAS/library/src/blas2/rocblas_trsv.cpp
clang-8: warning: argument unused during compilation: '--hip-device-lib-path=/home/yaxunl/git/rocdl/rel/dist/lib' [-Wunused-command-line-argument]
/home/yaxunl/git/rocblas/rocBLAS/library/src/blas2/rocblas_trsv.cpp:184:31: error: dynamic initialization is not supported for device, constant, and shared variables.
static const device T alpha_d{1};
^ ~~~
1 error generated when compiling for gfx900.

Supplying a wrapper for using rocBLAS with Python 3

It will be highly beneficial to end users if rocBLAS can be integrated into Python 3 in a form of a wrapper using PyopenCL.

For now I am using CLBlast as they do have a Python 3 package pyclblast, but for some reasons the SGEMM performance was sub-optimal on Vega where I obtained only ~3.5 TFLOPs at best. It may just be better if the AMD optimized rocBLAS can be utilized in Python.

Installing rocBLAS

What is the expected behavior

Complete configuration and build of rocBLAS

What actually happens

  1. Python2.7 is not detected on a system where Python3.x is default.
  2. Header files are missing while build: Tensile.h, etc. in "build/release/Tensile"

How to reproduce

Checkout rocBLAS and follow these instruction: https://github.com/ROCmSoftwarePlatform/rocBLAS/wiki/1.Build#build-dependencies--library-using-individual-commands

Environment

Hardware description
GPU Ryzen7 1800x
CPU Radeon RX 560
Software version
ROCK v1.9
ROCR v1.9
HCC v1.9
rocBLAS development branch

My solution:

  1. I installed "Tensile" but it is not detected, is there an envirment variable needed?
    In the mean time I modified "cmake/virtualenv.cmake":
-# find_package(PythonInterp)
+find_package(PythonInterp 2.7 REQUIRED)
 # # TODO: Check PYTHON_VERSION_MAJOR
 
-find_program(VIRTUALENV_PYTHON_EXE python)
+find_program(VIRTUALENV_PYTHON_EXE python2.7)

  1. add the path:
   set( Tensile_INC
     ${CMAKE_CURRENT_SOURCE_DIR}/blas3/Tensile
+    ${CMAKE_BINARY_DIR}/Tensile
   )
 endif( )

With these changes I can configure and build rocBLAS.

Build rocBLAS failed on nvcc platform

Hi, I followed https://github.com/ROCmSoftwarePlatform/rocBLAS/wiki/Setting-up-enviroment-on-NVIDIA-platform to install hip_nvcc successfully which in /opt/rocm/hip , but https://github.com/ROCmSoftwarePlatform/rocBLAS/wiki/Build-rocBLAS-libraries-and-verification-code to install rocBLAS faild like following:

  1. The C++ compiler "/opt/rocm/hip/bin/hipcc" is not able to compile a simple
    test program.
  2. nvcc fatal : Unknown option '-amdgpu-target'

my OS is ubuntu14.04 server x86_64, and when i compiler rocblas used ccmake .. it shows EMPTY CACHE,
so I can only execute cmake like this:
cmake .. -DBUILD_SHARED_LIBS=ON -DCMAKE_PREFIX_PATH=~/repos/rocBLAS/build/library-package
-DCPACK_PACKAGE_INSTALL_PREFIX=/opt/rocm/rocblas -DHIP_ROOT=/opt/rocm/hip
-DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DHOST_TOOLCHAIN_NAME=gcc

What is the problem here? Expecting a reply !

about the vega20_Cijk_Alik_Bljk_DB.yaml files

I want to know the meaning and function of the data at the bottom of the file,such as the bottom of vega20_Cijk_Alik_Bljk_DB.yaml:

      • [1, 64, 1, 64]
      • [1, 0.428463]
      • [64, 1, 1, 64]
      • [1, 0.451101]
      • [1, 1, 1, 64]
      • [1, 0.00617776]
      • [64, 64, 1, 64]
      • [0, 27.1934]
      • [1, 64, 1, 64]
      • [3, 0.528463]
      • [64, 1, 1, 64]
      • [3, 0.551101]
      • [1, 1, 1, 64]
      • [3, 0.10617776]
      • [64, 64, 1, 64]
      • [2, 27.2934]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.