IREE: An Experimental MLIR Execution Environment

DISCLAIMER: This is an early phase project that we hope will graduate into a supported form someday, but it is far from ready for everyday use and is made available without any support. With that said, feel free to browse the issues and reach out on the iree-discuss mailing list.

Contact
Build Status
Quickstart
Project Goals
Milestones
Status
Dependencies
License

Communication Channels

GitHub Issues: Preferred for specific issues and coordination on upcoming features.
Google IREE Discord Server: The core team and collaborators hang out here.
Google Groups Email List: General, low-priority discussion.

Related Project Channels

MLIR topic within LLVM Discourse: Often, discussions that span IREE and various infrastructure topics will fork into topics on this forum.

Build Status

CI System	Build System	Platform	Status
GitHub Actions	Bazel	Linux	Workflow History
Kokoro	Bazel	Linux
Kokoro	CMake	Linux

Quickstart

More Coming soon! Performing full model translation may require a few steps (such as ensuring you have a working TensorFlow build), however we'll have pre-translated example models that allow independent testing of the runtime portions.

Talks

We occasionally have either productive, recorded meetings or talks and will post them here.

Project Goals

IREE (Intermediate Representation Execution Environment, pronounced as "eerie") is an experimental compiler backend for MLIR that lowers ML models to an IR that is optimized for real-time mobile/edge inference against heterogeneous hardware accelerators.

The IR produced contains the sequencing information required to communicate pipelined data dependencies and parallelism to low-level hardware APIs like Vulkan and embed hardware/API-specific binaries such as SPIR-V or compiled ARM code. As the IR is specified against an abstract execution environment there are many potential ways to run a compiled model, and one such way is included as an example and testbed for runtime optimization experiments.

The included layered runtime scales from generated code for a particular API (such as emitting C code calling external DSP kernels), to a HAL (Hardware Abstraction Layer) that allows the same generated code to target multiple APIs (like Vulkan and Direct3D 12), to a full VM allowing runtime model loading for flexible deployment options and heterogeneous execution. Consider both the compiler and the included runtime a toolbox for making it easier - via the versatility of MLIR - to take ML models from their source to some varying degree of integration with your application.

Demonstrate MLIR

IREE has been developed alongside MLIR and is used as an example of how non-traditional ML compiler backends and runtimes can be built: it focuses more on the math being performed and how that math is scheduled rather than graphs of "ops" and in some cases allows doing away with a runtime entirely. It seeks to show how more holistic approaches that exploit the MLIR framework and its various dialects can be both easy to understand and powerful in the optimizations to code size, runtime complexity, and performance they enable.

Demonstrate Advanced Models

By using models with much greater complexity than the usual references (such as MobileNet) we want to show how weird things can get when model authors are allowed to get creative: dynamic shapes, dynamic flow control, dynamic multi-model dispatch (including models that conditionally dispatch other models), streaming models, tree-based search algorithms, etc. We are trying to build IREE from the ground-up to enable these models and run them efficiently on modern hardware. Many of our example models are sequence-to-sequence language models from the Lingvo project representing cutting edge speech recognition and translation work.

Demonstrate ML-as-a-Game-Engine

An observation that has driven the development of IREE is one of ML workloads not being much different than traditional game rendering workloads: math is performed on buffers with varying levels of concurrency and ordering in a pipelined fashion against accelerators designed to make such operations fast. In fact, most ML is performed on the same hardware that was designed for games! Our approach is to use the compiler to transform ML workloads to ones that look eerily (pun intended) similar to what a game performs in per-frame render workloads, optimize for low-latency and predictable execution, and integrate well into existing systems both for batched and interactive usage. The IREE runtime is designed to feel more like game engine middleware than a standalone ML inference system, though we still have much work towards that goal. This should make it easy to use existing tools for high-performance/low-power optimization of GPU workloads, identify driver or system issues introducing latency, and help to improve the ecosystem overall.

Demonstrate Standards-based ML via Vulkan and SPIR-V

With the above observation that ML can look like games from the systems perspective it follows that APIs and technologies good for games should probably also be good for ML. In many cases we've identified only a few key differences that exist and just as extensions have been added and API improvements have been made to graphics/compute standards for decades we hope to demonstrate and evaluate small, tactical changes that can have large impacts on ML performance through these standard APIs. We would love to allow hardware vendors to be able to make ML efficient on their hardware without the need for bespoke runtimes and special access such that any ML workload produced by any tool runs well. We'd consider the IREE experiment a success if what resulted was some worked examples that help advance the entire ecosystem!

Milestones

We are currently just at the starting line, with basic MNIST MLP running end-to-end on both a CPU interpreter and Vulkan. As we scale out the compiler we will be increasing the complexity of the models that can run and demonstrating more of the optimizations we've found useful in previous efforts to run them efficiently on edge devices.

A short-term Roadmap is available talking about the major areas where are focusing on in addition to the more infrastructure-focused work listed below.

We'll be setting up GitHub milestones with issues tracking major feature work we are planning. For now, our areas of work are:

Allocation tracking and aliasing in the compiler
Pipelined scheduler in the VM for issuing proper command buffers
New CPU interpreter that enables lightweight execution on ARM and x86
C code generator and API to demonstrate "runtimeless" mode
Quantization using the MLIR quantization framework
High-level integration and examples when working with TensorFlow 2.0
Switching from IREE's XLA-to-SPIR-V backend to the general MLIR SPIR-V backend

Things we are interested in but don't yet have in-progress:

Ahead-of-time compiled ARM NEON backend (perhaps via SPIRV-LLVM, SPIRV-to-ISPC, or some other technique)
HAL backends for Metal 2 and Direct3D 12
Profile-guided optimization support for scheduling feedback

Current Status

Documentation

Coming soon :)

Build System and CI

We support Bazel for builds of all parts of the project.
We also maintain a CMake build for a subset of runtime components designed to be used in other systems.

Code and Style

The project is currently very early and a mix of code written prior to a lot of the more recent ergonomics improvements in MLIR and its tablegen. Future changes will replace the legacy code style with prettier forms and simplify the project structure to make it easier to separate the different components. Some entire portions of the code (such as the CPU interpreter) will likely be dropped or rewritten. For now, assume churn!

The compiler portions of the code (almost exclusively under iree/compiler/) follows the LLVM style guide and has the same system requirements as MLIR itself. It general requires a more modern C++ compiler.

The runtime portions vary but most are designed to work with C++11 and use Abseil to bring in future C++14 and C++17 features.

Hardware Support

We are mostly targeting Vulkan and Metal on recent mobile devices and as such have limited our usage of hardware features and vendor extensions to those we have broad access to there. This is mainly just to keep our focus tight and does not preclude usage of features outside the standard sets or for other hardware types (in fact, we have a lot of fun ideas for VK_NVX_device_generated_commands and Metal 2.1's Indirect Command Buffers!).

Dependencies

NOTE: during the initial open source release we are still cleaning things up. If there's weird dependencies/layering that makes life difficult for your particular use case please file an issue so we can make sure to fix it.

Compiler

The compiler has several layers that allow scaling the dependencies required based on the source and target formats. In all cases MLIR is required and for models not originating from TensorFlow (or already in XLA HLO format) it is the only dependency. When targeting the IREE Runtime VM and HAL FlatBuffers is required for serialization. Converting from TensorFlow models requires a dependency on TensorFlow (however only those parts required for conversion).

Runtime VM

The VM providing dynamic model deployment and advanced scheduling behavior requires Abseil for its common types, however contributions are welcome to make it possible to replace Abseil with other libraries via shims/forwarding. The core types used by the runtime (excluding command line flags and such in tools) are limited to types coming in future C++ versions (variant, optional, string_view, etc), cheap types (absl::Span), or simple standard containers (absl::InlinedVector). FlatBuffers is used to load compiled modules.

Runtime HAL

The HAL and the provided implementations (Vulkan, etc) also use Abseil. Contributions are welcome to allow other types to be swapped in. A C99 HAL API is planned for code generation targets that will use no dependencies.

Testing and Tooling

Swiftshader is used to provide fast hardware-independent testing of the Vulkan and SPIR-V portions of the toolchain.

License

IREE is licensed under the terms of the Apache license. See LICENSE for more information.

qqsun8819 / iree Goto Github PK

iree's Introduction