iree-org / iree-nvgpu Goto Github PK

License: Apache License 2.0

Python 1.55% CMake 6.26% Starlark 4.22% C++ 39.38% C 2.27% MLIR 46.08% Shell 0.24%

iree-nvgpu's Introduction

OpenXLA NVIDIA GPU Compiler and Runtime

This project contains the compiler and runtime plugins enabling specialized targeting of the OpenXLA platform to NVIDIA GPUs. It builds on top of the core IREE toolkit.

Development setup

The project can be built either as part of IREE by manually specifying plugin paths via -DIREE_COMPILER_PLUGIN_PATHS, or for development tailored to NVIDIA GPUs specifically, can be built directly:

cmake -GNinja -B build/ -S . \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DIREE_ENABLE_ASSERTIONS=ON \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DIREE_ENABLE_LLD=ON

# Recommended:
# -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache

Note that you will need a check-out of the IREE codebase in ../iree relative to the directory where the openxla-nvgpu compiler was checked out. Running the sync_deps.py script should bring in all source dependencies at needed versions (into the parent directory).

Additional options for configuring IREE are in the IREE getting started guide for details of how to set this up.

Installing dependencies

You must have a CUDA Toolkit installed together with a cuDNN (see instructions).

See the project settings for options to build without components requiring full dependencies.

On Linux platform path to libcudnn.so should be added to LD_LIBRARY_PATH.

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

Running tests

Some of the tests can run only on an Ampere+ devices because they rely on the cuDNN runtime fusion engine.

Tests that depend on having a device present can be disabled with -DOPENXLA_NVGPU_INCLUDE_DEVICE_TESTS=OFF.

cmake --build build --target openxla-nvgpu-run-tests

Project Maintenance

This section is a work in progress describing various project maintenance tasks.

Pre-requisite: Install openxla-devtools

pip install git+https://github.com/openxla/openxla-devtools.git

Sync all deps to pinned versions

openxla-workspace sync

Update IREE to head

This updates the pinned IREE revision to the HEAD revision at the remote.

# Updates the sync_deps.py metadata.
openxla-workspace roll iree
# Brings all dependencies to pinned versions.
openxla-workspace sync

Full update of all deps

This updates the pinned revisions of all dependencies. This is presently done by updating openxla-pjrt-plugin to remote HEAD and deriving the IREE dependency from its pin.

# Updates the sync_deps.py metadata.
openxla-workspace roll nightly
# Brings all dependencies to pinned versions.
openxla-workspace sync

Pin current versions of all deps

This can be done if local, cross project changes have been made and landed. It snapshots the state of all deps as actually checked out and updates the metadata.

openxla-workspace pin

iree-nvgpu's People

Contributors

Stargazers

Watchers

Forkers

ezhulenev mjsml chsigg frgossen sherhut nicolasvasilache chelini matthias-springer qcolombet rockdog22 grypp retonym ftynse bviyer yijia1212 okkwon ernestyalumni abdulrahman305 stellaraccident

iree-nvgpu's Issues

[cuDNN] Add lowering from cudnn.relu + relu custom call

Relu operation should be supported by cudnn compiler plugin and runtime.

[Triton] Do not use temp files for passing PTX to HAL executable

https://github.com/openxla/openxla-nvgpu/blob/628fc75bcfe00c0ab2551824914a896668f21909/compiler/src/openxla/compiler/nvgpu/Dialect/TritonFlow/Conversion/ConvertTritonToFlowDispatch.cpp#L231-L232

Executable source has data attribute. Also HAL::ExecutabalObject has to be updated to use DenseI8ArrayAttribute, today it uses IntArrayAttribute that doesn't allow to constraint the attribute type to I8.

[Triton] Add minimal matmul example (no auto tuning)

Take one config from: https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py
Trace it to Triton IR
Add end-to-end test

This will verify that everything works end-to-end:

num-warps and stages propagated to compiler
block dimensions are correctly computed
shared memory works

[Triton] Block dimension should be inferred from the Triton function

Today we hardcode it to 64 here: https://github.com/openxla/openxla-nvgpu/blob/628fc75bcfe00c0ab2551824914a896668f21909/compiler/src/openxla/compiler/nvgpu/Dialect/TritonFlow/Conversion/ConvertTritonToFlowDispatch.cpp#L264-L266

It should be computed from the num-warps attribute, see example in jax: https://github.com/google/jax/blob/e785574b2b7b13f063a5399b13c41c422c54ea14/jaxlib/gpu/triton.cc#LL68C54-L68C54

Set up continuous integration for OpenXLA Nvgpu project

We need a CI that runs all tests + presubmit checks, etc... all the goodies we have in IREE core.

[Triton] Construct stream.executable from Triton functions

Depends on: #142

Instead of building hal.executable.source with PTX it should be possible to build stream.executable with nested module with LLVM IR. In this case Triton plugin will not have to make sure Triton ABI matches IREE HAL ABI. Triton kernel will become somewhat like a ukernel (see example in iree-org/iree#14101 (comment))

Discussion: https://discord.com/channels/689900678990135345/804403964463546378/1118690243672031264

[Epic] Production integration of cuBLAS, cuDNN, and Triton

MS2 Epic level item - to track library integration work.

### P1 cuBLAS
- [ ] cuBLAS integration for GEMMs - needs owner

### P1 Triton
- [ ] Triton integration @ezhulenev 
- [ ] #54 
- [ ] #13848

### P2 cuDNN
- [ ] cuDNN for MHA (flash attention) integration
- [ ] https://github.com/openxla/openxla-nvgpu/issues/98

[RFC] First class Triton support in OpenXLA Nvgpu

[RFC] First class Triton support in OpenXLA-Nvgpu

We want to improve the state of Triton and OpenXLA integration, and make jax-triton more user and compiler friendly.

Please let us know what you think!

[Triton] Use nested passes to lower Triton to HAL executable

Currently we construct a separate PassManager to lower from Triton to LLVMIR and PTX, we should use nesting to run Triton -> LLVM lowering as a part of IREE compilation pipeline.

[cuDNN] cuDNN module should create a dedicated CUDA stream

Right now all cuDNN operations launched on a NULL stream. CudnnModule at initialization time should create a CUDA stream that it will use for running all operations. Maybe even a pool of CUDA streams?

[Triton] Compute capability for hal.executable.source must match IREE cuda SM version

Currently we hardcode it to whatever is in TritonOptions, it has to be the same as the target SM version of the IREE cuda compiler

[RFC] Integration with cuDNN via IREE compiler/runtime plugins

One of the initial goals of openxla-nvgpu plugin is to show how to integrate NVIDIA libraries with the IREE compiler/runtime. The work is already in progress, and few PRs are merged. The design of this integration is outlined in this document: https://docs.google.com/document/d/1WzSH7LdQdL1CQmlIOUyy6auDiX6d3cl5LAzZU_I4KCY/edit#

[Triton] num-warps and num-stages should be a property of triton.executable.export

Today num-warps and num-stages are configured globally in properties here: https://github.com/openxla/openxla-nvgpu/blob/628fc75bcfe00c0ab2551824914a896668f21909/compiler/src/openxla/compiler/nvgpu/Dialect/TritonFlow/Conversion/ConvertTritonToFlowDispatch.h#L29-L33

It should be an attribute on triton.call (or tt.func) and triton.executable.export because different triton function can have different parameters.

[Triton] Triton executables should support shared memory

Currently we do not forward shared memory requirements from Triton compiler to dispatch operation

[cuDNN] Use semaphores to synchronize before/after cuDNN operations

Depends on: #138

Today cuDNN module syncrhonizes with the host after each execute operation and uses synchronous IREE custom call ABI. It should use semaphores to wait/signal completion of launched operations on a dedicated cuDNN stream. See async example in IREE: https://github.com/openxla/iree/tree/main/samples/custom_module/async

[cuDNN] Use destination-passing style @cudnn.execute API

Instead of allocating result buffers in the custom cuDNN module we should pass everything as an argument.

Example:

%ret = flow.tensor.empty : tensor<4x4x4x4xf32>
%ret_buffer = hal.tensor.export %ret : tensor<4x4x4x4xf32> -> !hal.buffer
call @cudnn.execute(..., %ret_buffer) : (...) -> ()