Giter Site home page Giter Site logo

rocwmma's Introduction

rocWMMA

Welcome! rocWMMA is a C++ library for accelerating mixed-precision matrix multiply-accumulate (MMA) operations leveraging AMD GPU hardware. rocWMMA makes it easier to break down MMA problems into fragments and distribute block-wise MMA operations in parallel across GPU wavefronts. The API consists of a header library, that can be used to compile MMA acceleration directly into GPU kernel device code. This can benefit from compiler optimization in the generation of kernel assembly, and doesn't incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.

rocWMMA includes sample projects to validate and demonstrate API usage. These include simple GEMMs, performant GEMMs, DLRM, GEMV and hipRTC integration.

The test suite includes validation and benchmarking projects that focus on unit testing, GEMMs and DLRM.

Requirements

rocWMMA currently supports the following AMDGPU architectures:

  • CDNA class GPU featuring matrix core support: gfx908, gfx90a, gfx940, gfx940, gfx942 as 'gfx9'
  • RDNA3 class GPU featuring AI acceleration support: gfx1100, gfx1101, gfx1102 as 'gfx11'

Dependencies:

  • Minimum ROCm version support is 6.0.
  • Minimum cmake version support is 3.14.
  • Minimum ROCm-cmake version support is 0.8.0.
  • Minimum rocBLAS version support is rocBLAS 4.0.0 for ROCm 6.0* (or ROCm packages rocblas and rocblas-dev).
  • Minimum HIP runtime version support is 4.3.0 (or ROCm package ROCm hip-runtime-amd).
  • Minimum LLVM OpenMP runtime dev package version support is 10.0 (available as ROCm package rocm-llvm-dev).
    * = if using rocBLAS for validation.

    It is best to use available ROCm packages from the same release where applicable.

Build with CMake

For more detailed information, please refer to the rocWMMA installation guide.

Project options

Option Description Default value
AMDGPU_TARGETS Build code for specific GPU target(s) gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx1100;gfx1101;gfx1102
ROCWMMA_BUILD_TESTS Build Tests ON
ROCWMMA_BUILD_SAMPLES Build Samples ON
ROCWMMA_BUILD_DOCS Build doxygen documentation from code OFF
ROCWMMA_BUILD_ASSEMBLY Generate assembly files OFF
ROCWMMA_BUILD_VALIDATION_TESTS Build validation tests ON (requires ROCWMMA_BUILD_TESTS=ON)
ROCWMMA_BUILD_BENCHMARK_TESTS Build benchmark tests OFF (requires ROCWMMA_BUILD_TESTS=ON)
ROCWMMA_BUILD_EXTENDED_TESTS Build extended testing coverage OFF (requires ROCWMMA_BUILD_TESTS=ON)
ROCWMMA_VALIDATE_WITH_ROCBLAS Use rocBLAS for validation tests ON (requires ROCWMMA_BUILD_VALIDATION_TESTS=ON)
ROCWMMA_BENCHMARK_WITH_ROCBLAS Include rocBLAS benchmarking data OFF (requires ROCWMMA_BUILD_BENCHMARK_TESTS=ON)
ROCWMMA_USE_SYSTEM_GOOGLETEST Use system Google Test library instead of downloading and building it OFF (requires ROCWMMA_BUILD_TESTS=ON)

Example configurations

By default, the project is configured in release mode and is linked against rocBLAS for validating results. Here are some configuration examples:

Configuration Command
Basic CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B<build_dir> .
Targeting gfx908 CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B<build_dir> . -DAMDGPU_TARGETS=gfx908:xnack-
Debug build CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B<build_dir> . -DCMAKE_BUILD_TYPE=Debug
Build without rocBLAS (default on) CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B<build_dir> . -DROCWMMA_VALIDATE_WITH_ROCBLAS=OFF -DROCWMMA_BENCHMARK_WITH_ROCBLAS=OFF

After configuration, build with cmake --build <build_dir> -- -j<nproc>

Documentation

For more comprehensive documentation on installation, samples and test contents, API reference and programmer's guide you can build the documentation locally in different ways.

Html

cd docs

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

The HTML documentation can be viewed in your browser by opening docs/_build/html/index.html result.

Pdf

cd docs

sudo apt-get update
sudo apt-get install doxygen
sudo apt-get install texlive-latex-base texlive-latex-extra

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b latex -d _build/doctrees -D language=en . _build/latex

cd _build/latex

pdflatex rocwmma.tex

Running the above commands generates rocwmma.pdf.

The latest official documentation for rocWMMA is available at: https://rocm.docs.amd.com/projects/rocWMMA/en/latest/index.html.

Contributing to the rocWMMA Library

Community collaboration is encouraged! If you are considering contributing, please follow the rocWMMA Contribution Guide to get started.

rocwmma's People

Contributors

amd-jmacaran avatar angryloki avatar arvindcheru avatar bragadeesh avatar cgmillette avatar congma13 avatar dependabot[bot] avatar dlangbe avatar eidenyoshida avatar lawruble13 avatar lisadelaney avatar mkarunan avatar raramakr avatar rocmmathlibrariesbot avatar rosenrodt avatar samjwu avatar swraw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rocwmma's Issues

INT4 support

RDNA3 matrix cores support WMMA instructions with int4 input type, but this seems to be unsupported in rocWMMA. Are there plans to add int4 support to rocWMMA?

CMake error with FindOpenMP

I built rocwmma from source code with tag rocm-5.7.0.
cmake -S . -B build -G Ninja -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ROCM_PATH -DCMAKE_PREFIX_PATH=$ROCM_PATH -DAMDGPU_TARGETS=gfx1102 -DROCWMMA_BUILD_TESTS=OFF -DROCWMMA_BUILD_SAMPLES=OFF
Error reported as follows:

CMake Error at /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.22/Modules/FindOpenMP.cmake:544 (find_package_handle_standard_args)
  CMakeLists.txt:98 (find_package)

Here's my question:

  1. Whether openmp in rocwmma is for cpus or gpus? What is the purpose of using openmp?
  2. Do I need to compile openmp support for amdgpu and how?
  3. I compiled llvm-project and hipcc from source,It does not have OpenMP-related header files and libraries.What can I do to solve this problem?

Looking forward to reply

[Feature]: Compile for Windows library

Suggestion Description

I tried to build rocwmma for windows to use in llama cpp but I cannot find any documentation.
rocm 5.7.1 does not seems have rocwmma support.

Operating System

Windows11

GPU

7900XTX

ROCm Component

No response

[Issue]: rocWMMA 6.0.2 build with tests fails with "no member named 'cout' in namespace 'std'"

Problem Description

From build log:

FAILED: test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/wave/CMakeFiles/gemm_PGR1_LB2_MP0_MB_CP_WV-validate.dir/16x16_tt_1x1.cpp.o 
/usr/bin/cmake -E time /usr/bin/hipcc -DROCWMMA_VALIDATE_WITH_ROCBLAS -DROCWMMA_VALIDATION_TESTS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -I/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/wave -I/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test -I/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP -I/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm -I/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/library/include -isystem /usr/include/rocblas  -march=native -O2 -pipe -fdiagnostics-color=always -flto=thin -Werror=odr -Werror=strict-aliasing -std=c++17 -mcmodel=large -x hip --offload-arch=gfx908:xnack- -fopenmp=libomp -MD -MT test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/wave/CMakeFiles/gemm_PGR1_LB2_MP0_MB_CP_WV-validate.dir/16x16_tt_1x1.cpp.o -MF test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/wave/CMakeFiles/gemm_PGR1_LB2_MP0_MB_CP_WV-validate.dir/16x16_tt_1x1.cpp.o.d -o test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/wave/CMakeFiles/gemm_PGR1_LB2_MP0_MB_CP_WV-validate.dir/16x16_tt_1x1.cpp.o -c /var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/wave/16x16_tt_1x1.cpp
In file included from /var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/wave/16x16_tt_1x1.cpp:27:
In file included from /var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/test/test_includes.hpp:31:
In file included from /var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/detail/kernel_generator_impl.hpp:33:
In file included from /var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/detail/kernel_impl.hpp:30:
In file included from /var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/device/kernel_device_func.hpp:36:
In file included from /var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_PGR1_LB2_MP0_MB_CP/device/kernel_predicates.hpp:30:
/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_predicates_base.hpp:92:18: error: no member named 'cout' in namespace 'std'
   92 |             std::cout << "Global Predicates:\n";
      |             ~~~~~^
/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_predicates_base.hpp:93:18: error: no member named 'cout' in namespace 'std'
   93 |             std::cout << "TBlockXTest: " << (bool)GlobalPredicates::TBlockXTest << std::endl;
      |             ~~~~~^
/var/tmp/portage/sci-libs/rocWMMA-6.0.2/work/rocWMMA-rocm-6.0.2/test/gemm/gemm_predicates_base.hpp:94:18: error: no member named 'cout' in namespace 'std'
   94 |             std::cout << "MinTBlockTest: " << (bool)GlobalPredicates::MinTBlockTest << std::endl;
      |             ~~~~~^
...

Please add #include <iostream> to https://github.com/ROCm/rocWMMA/blob/develop/test/gemm/gemm_predicates_base.hpp (or remove prints).

Operating System

Gentoo

CPU

AMD Ryzen 7 8700G

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.0.0

ROCm Component

rocWMMA

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Provide an option to build with system googletest

Suggestion Description

Hi, could you provide an option to build tests using preinstalled system googletest?
This will make building and testing easier for maintainers.
I will provide a PR similar to one that I used in Gentoo.

Operating System

No response

GPU

No response

ROCm Component

No response

Any conversion functions for rocWMMA type

Hi,
Since rocWMMA provided the separate datatype like rocwmma::bfloat16, I wondered if there is any functions which can convert float number to your rocwmma half or bfloat16 like __float2bfloat16 in NVIDIA? Thanks!

[Issue]: MI250X SGEMM performance

Problem Description

Hi,

I am testing with sample/perf_sgemm.cpp on MI250X with ROCM/5.7.1.
When the problem size is (4096, 4096, 128), the achieved throughput is only ~13 TFLOPS, which is below 30% of the peak.
Is this performance expected for this problem?

Operating System

SUSE

CPU

AMD EPYC 7A53

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

verification fails for the simple MM example

The HIP example at https://github.com/zjin-lcf/HeCBench/tree/master/src/wmma-hip is similar to codes in https://rocm.docs.amd.com/projects/rocWMMA/en/latest/API_Reference_Guide.html:

The matrices A and B are of type float16 while the matrices C and D are of float32. However, the verification fails for the example after the host GEMM is added for validation.

// Host GEMM validation
void gemm_cpu_h(uint32_t m, uint32_t n, uint32_t k, rocwmma::float16_t const *a,
                rocwmma::float16_t const *b, rocwmma::float32_t const *c, rocwmma::float32_t *d,
                uint32_t lda, uint32_t ldb, uint32_t ldc, uint32_t ldd,
                rocwmma::float32_t alpha, rocwmma::float32_t beta) {
  for (uint32_t i = 0; i < m; ++i) {
    for (uint32_t j = 0; j < n; ++j) {
      rocwmma::float32_t accum = 0.0f;
      for (uint32_t h = 0; h < k; ++h) {
        accum += static_cast<rocwmma::float32_t>(a[i * lda + h]) *
                 static_cast<rocwmma::float32_t>(b[j * ldb + h]);
      }
      d[i * ldd + j] = alpha * accum + beta * c[i * ldc + j];
    }
  }
}

It is successful for the CUDA example (https://github.com/zjin-lcf/HeCBench/tree/master/src/wmma-cuda).

I am not sure if there are issues in the example or library. Thank you.

RDNA3 WMMA‘s peak performance

I apologize for not knowing where to ask the question.
I found architecture details on this link: tomshardware.com.
The slides show that WMMA instruction can only achieve 2x fp16/bf16 performance compared to vector multiply-and-add fp32.
However, the same instruciton for MI100 can achieve 8x performance.

I would like to know the peak performance for matrix multiplication for RDNA3. Thank you.

MI100 performance

What is the expected performance of MI100? I was expecting a much higher number since the theoretical performance is more than 180TF. I was getting higher numbers when I was testing 7900XTX even though it has a lower theoretical peak performance!

./perf_sgemm
Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 165.95, 736.587, 22.193
Finished!
./perf_hgemm
Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 78.1733, 736.587, 47.1124
Finished!

[Issue]: std:enable_if compiler error.

Problem Description

What can be a reason of below compiler error during calling of orortcCompileProgram? I see that OrochiUtils::getFunction is using -std=c++17" option to compile kernel.

In file included from C:\Users\Piotr\AppData\Local\Temp\comgr-2cfecd\input\third_party/MiniNN/MiniNN/Gpu/Kernels/NetworkGpuKernels.h:26:
In file included from third_party/MiniNN/contrib/rocWMMA/library/include\rocwmma/rocwmma.hpp:29:
In file included from third_party/MiniNN/contrib/rocWMMA/library/include\rocwmma/internal/io_config.hpp:29:
In file included from third_party/MiniNN/contrib/rocWMMA/library/include\rocwmma/internal/broadcast.hpp:29:
In file included from third_party/MiniNN/contrib/rocWMMA/library/include\rocwmma/internal/types.hpp:131:
third_party/MiniNN/contrib/rocWMMA/library/include\rocwmma/internal/vector.hpp:145:28: error: no template named 'enable_if' in namespace 'std'; did you mean '__hip_internal::enable_if'?
typename std::enable_if<(std::is_same<U, T>{}) && (Rank > 1)>::type* = nullptr>
^~~~~~~~~~~~~~
__hip_internal::enable_if

Operating System

Windows 10.0.19045

CPU

AMD Ryzen 7 5800X 8-Core Processor

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Advantages of Using rocWMMA over Compiler Intrinsics for CUDA to ROCm Transition

Hello,

I'm currently in the process of transitioning from CUDA to ROCm. During this transition, I've come to understand that rocWMMA can serve as a mapping library for the "Warp matrix functions nvcuda::wmma" that I used to work with in CUDA. Nevertheless, I've also noticed that we have "compiler intrinsics", which provide functionality quite similar to rocWMMA, see this link.

Specifically, I've observed that we can substitute the load operations in rocWMMA with specific built-in intrinsics. This realization has led me to question why I might choose to use rocWMMA over these compiler intrinsics.

What advantages does rocWMMA provide over these compiler intrinsics? Is it more efficient in certain situations or does it offer any unique functionalities? Any insight on this would be highly appreciated.

Thank you for your assistance.

Using rocwmma with pytorch

I want to be able to convert a cuda code containing wmma into hip. I have unit tests done and it works. I hope to integrate this code into pytorch. When I executed "python setup.py install", I found that all the architectures were added when the code was compiled, so the execution reported an error. Because rocwmma does not support gfx1030. What should I do to avoid this error? Can I just compile for a certain architecture?

This is the content of the setup.py file:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
from pathlib import Path
import os

workspace_dir = Path(os.path.dirname(os.path.abspath(__file__)))

setup(
    name="fused_attn",
    ext_modules=[
        CUDAExtension(
            name="fused_attn",
            sources=[str(workspace_dir / "src" / "fused_attn_extention.cu")],
            include_dirs=[str(workspace_dir / "include")],
            extra_compile_args=[
                "-O3", 
                "-std=c++20", 
                "--offload-arch=gfx90a",
                "-I/opt/rocm/include",
                "-I/opt/rocm/hip/include"
                ],
        )
    ],
    cmdclass={
        "build_ext": BuildExtension
    }
)

The following is part of the error report:(I specified the architecture, but pytorch still adds all architectures.)

[1/1] /opt/rocm/bin/hipcc  -I/data/zhaorong/code/fused-attention/include -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/rocm/miopen/include -I/opt/rocm/hip/include -I/opt/conda/include/python3.8 -c -c /data/zhaorong/code/fused-attention/src/fused_attn_extention.hip -o /data/zhaorong/code/fused-attention/build/temp.linux-x86_64-cpython-38/data/zhaorong/code/fused-attention/src/fused_attn_extention.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 


--offload-arch=gfx90a 


-I/opt/rocm/include -I/opt/rocm/hip/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fused_attn -D_GLIBCXX_USE_CXX11_ABI=1 


--amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 


-fno-gpu-rdc
FAILED: /data/zhaorong/code/fused-attention/build/temp.linux-x86_64-cpython-38/data/zhaorong/code/fused-attention/src/fused_attn_extention.o 
/opt/rocm/bin/hipcc  -I/data/zhaorong/code/fused-attention/include -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/rocm/miopen/include -I/opt/rocm/hip/include -I/opt/conda/include/python3.8 -c -c /data/zhaorong/code/fused-attention/src/fused_attn_extention.hip -o /data/zhaorong/code/fused-attention/build/temp.linux-x86_64-cpython-38/data/zhaorong/code/fused-attention/src/fused_attn_extention.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 --offload-arch=gfx90a -I/opt/rocm/include -I/opt/rocm/hip/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fused_attn -D_GLIBCXX_USE_CXX11_ABI=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.


In file included from /data/zhaorong/code/fused-attention/src/fused_attn_extention.hip:4:
In file included from /data/zhaorong/code/fused-attention/include/fused_attn_hip.cuh:6:
In file included from /opt/rocm-5.4.0/include/rocwmma/rocwmma.hpp:31:
In file included from /opt/rocm-5.4.0/include/rocwmma/internal/io_config.hpp:29:
In file included from /opt/rocm-5.4.0/include/rocwmma/internal/broadcast.hpp:29:
In file included from /opt/rocm-5.4.0/include/rocwmma/internal/types.hpp:339:
/opt/rocm-5.4.0/include/rocwmma/internal/types_ext.hpp:328:40: error: no matching conversion for static_cast from 'const rocwmma::hfloat16_t' (aka 'const __half') to 'rocwmma::float16_t' (aka '_Float16')
        return static_cast<hfloat16_t>(static_cast<float16_t>(x) * static_cast<float16_t>(y));
                                       ^~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/hip/include/hip/amd_detail/../../../../include/hip/amd_detail/amd_hip_fp16.h:233:13: note: candidate function
            operator __half_raw() const { return __half_raw{data}; }
            ^
/opt/rocm/hip/include/hip/amd_detail/../../../../include/hip/amd_detail/amd_hip_fp16.h:235:13: note: candidate function
            operator __half_raw() const volatile

Environment:
rocm: 5.4
ubuntu: 20.04
python: 3.8
pytorch: 1.12.1
GPU: MI210
rocwmma-dev: 0.7.0.50400-72~20.04

Matrix Core processes matrices smaller than 16x16

Hello, rocwmma can compute semi-precision 16x16x16 matrix multiplication. If you calculate a 15x15x15 matrix and you have to fill in the LDS with 16x16x16. This is significantly slower than reading directly from global memory. Is there a better way?

Clarification on Using fp16/bf16/half as Compute Type

Hello,

While reviewing your documentation, I came across the mention that when executing wmma, the compute type can be set to fp16/bf16/half, which are described as "native f32 accumulations downcasted to fp16/bf16/half". I'm curious to understand whether utilizing these lower precision types in compute (which is not enabled in CUDA) can lead to performance improvements. Additionally, could you guide me on how to set this compute type if it can help performance?

Thank you!

Errors when I'm trying to building rocWMMA on rocm5.3.0

From the previous post, you mentioned rocWMMA is enabled from rocm5.2.0, so I'm trying to build it following your document. However, it gives me these errors

$ CC=hipcc CXX=hipcc cmake -B./build . -DAMDGPU_TARGETS=gfx908:xnack- -DROCWMMA_VALIDATE_WITH_ROCBLAS=OFF -DROCWMMA_BENCHMARK_WITH_ROCBLAS=OFF
-- The CXX compiler identification is Clang 15.0.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /opt/rocm-5.3.0/bin/hipcc
-- Check for working CXX compiler: /opt/rocm-5.3.0/bin/hipcc - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCXXCompiler.cmake:59 (message):
  The C++ compiler

    "/opt/rocm-5.3.0/bin/hipcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: rocWMMA/build/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_28327/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_28327.dir/build.make CMakeFiles/cmTC_28327.dir/build
    gmake[1]: Entering directory 'rocWMMA/build/CMakeFiles/CMakeTmp'
    Building CXX object CMakeFiles/cmTC_28327.dir/testCXXCompiler.cxx.o
    /opt/rocm-5.3.0/bin/hipcc   --gcc-toolchain=/share/apps/gcc/8.4.0  -MD -MT CMakeFiles/cmTC_28327.dir/testCXXCompiler.cxx.o -MF CMakeFiles/cmTC_28327.dir/testCXXCompiler.cxx.o.d -o CMakeFiles/cmTC_28327.dir/testCXXCompiler.cxx.o -c rocWMMA/build/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
    In file included from <built-in>:1:
    /opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/__clang_hip_runtime_wrapper.h:50:10: fatal error: 'cmath' file not found
    #include <cmath>
             ^~~~~~~
    1 error generated when compiling for gfx908.
    gmake[1]: *** [CMakeFiles/cmTC_28327.dir/build.make:79: CMakeFiles/cmTC_28327.dir/testCXXCompiler.cxx.o] Error 1
    gmake[1]: Leaving directory 'rocWMMA/build/CMakeFiles/CMakeTmp'
    gmake: *** [Makefile:127: cmTC_28327/fast] Error 2





  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:39 (project)


-- Configuring incomplete, errors occurred!
See also "rocWMMA/build/CMakeFiles/CMakeOutput.log".
See also "rocWMMA/build/CMakeFiles/CMakeError.log".

Do you know have any ideas on why this happen? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.