Giter Site home page Giter Site logo

answerdotai / gpu.cpp Goto Github PK

View Code? Open in Web Editor NEW
3.5K 45.0 168.0 6.05 MB

A lightweight library for portable low-level GPU computation using WebGPU.

Home Page: https://gpucpp.answer.ai

License: Apache License 2.0

CMake 3.62% Makefile 4.70% C++ 79.24% Python 5.26% JavaScript 7.18%

gpu.cpp's Introduction

gpu.cpp

gpu.cpp is a lightweight library that makes portable GPU compute with C++ simple.

It focuses on general purpose native GPU computation, leveraging the WebGPU specification as a portable low-level GPU interface. This means we can drop in GPU code in C++ projects and have it run on Nvidia, Intel, AMD, and other GPUs. The same C++ code can work on a wide variety of laptops, workstations, mobile devices or virtually any hardware with Vulkan, Metal, or DirectX support.

Technical Objectives: Lightweight, Fast Iteration, and Low Boilerplate

With gpu.cpp we want to enable a high-leverage library for individual developers and researchers to incorporate GPU computation into programs relying on nothing more than a standard C++ compiler as tooling. Our goals are:

  • High power-to-weight ratio API: Provide the smallest API surface area that can cover the full range of GPU compute needs.
  • Fast compile/run cycles: Ensure projects can build nearly instantaneously, compile/run cycles should be <5 seconds on a modern laptop.
  • Minimal dependencies and tooling overhead: A standard clang C++ compiler should be enough, no external library dependencies beyond the WebGPU native implementation.

The implementation aims for a small API surface area with minimum boilerplate. There are a small number of library operations to carry out an broad range of low-level GPU operations. We avoid abstractions that add layers of indirection, making the mapping between the gpu.cpp library to raw WebGPU API clear when it's needed.

In this spirit of fast experimentation, we also want near-instantaneous C++ builds taking no more than a second or two even on modestly capable personal computing devices. With this in mind, we not only keep the API surface area small, but also keep the implementation small and we also provide a prebuilt binary of the Dawn native WebGPU implementation.

The core library implementation in the header-only gpu.h source code is around 1000 lines of code. In addition to enabling instantaneous, semi-interactive compilation cycles, the small implementation surface area keeps maintenance burden low and the velocity of improvements high. We also pre-build Google's Dawn WebGPU implementation as a shared library binary. This allows builds to link the shared library with each build and incorporate Google's powerful native WebGPU implementation without paying the cost of re-compiling Dawn during development cycles.

For more advanced users and release deployments, we include cmake examples for building both Dawn with gpu.cpp end-to-end, but this is not required nor recommended for most users to get started.

Quick Start: Building and Running

To build a gpu.cpp project, you will need to have installed on your system:

  • clang++ compiler installed with support for C++17.
  • python3 and above, to run the script which downloads the Dawn shared library.
  • make to build the project.
  • Only on Linux systems - Vulkan drivers. If Vulkan is not installed, you can run sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools to install them.

The only library dependency of gpu.cpp is a WebGPU implementation. Currently we support the Dawn native backend, but we plan to support other targets and WebGPU implementations (web browsers or other native implementations such as wgpu). Currently we support MacOS, Linux, and Windows (via WSL).

Optionally, Dawn can be built from scratch with gpu.cpp using the cmake build scripts provided - see the -cmake targets in the Makefile. However, this is recommended for advanced users only. Building Dawn dependencies with cmake takes much longer than using the precompiled Dawn shared library.

After cloning the repo, from the top-level gpu.cpp, you should be able to build and run the hello world GELU example by typing:

make

The first time you build and run the project this way, it will download a prebuilt shared library for the Dawn native WebGPU implementation automatically (using the setup.py script). This places the Dawn shared library in the third_party/lib directory. Afterwards you should see libdawn.dylib on MacOS or libdawn.so on Linux. This download only occurs once.

The build process itself should take a few seconds. If the build and executions is successful, you should see the output of the GELU computation:

Hello gpu.cpp!
--------------

  gelu(0.00) = 0.00
  gelu(0.10) = 0.05
  gelu(0.20) = 0.12
  gelu(0.30) = 0.19
  gelu(0.40) = 0.26
  gelu(0.50) = 0.35
  gelu(0.60) = 0.44
  gelu(0.70) = 0.53
  gelu(0.80) = 0.63
  gelu(0.90) = 0.73
  gelu(1.00) = 0.84
  gelu(1.10) = 0.95
  ...

Computed 10000 values of GELU(x)

If you need to clean up the build artifacts, you can run:

make clean

Hello World Tutorial: A GELU Kernel

As a real-world example for how to use gpu.cpp, let's start with a practical-but-simple example of a GPU kernel from neural networks.

GELU is a non-linear embarassingly parallel operation often used in modern large language model transformer-based architectures.

It takes as input a vector of floats and applies the GELU function to each element of the vector. The function is nonlinear, attenuating values below zero to near zero, approximating the y = x identity function for large positive values. For values close to zero, GELU smoothly interpolates between the identity function and the zero function.

The GELU code below will illustrate the three main aspects of setting up a GPU computation with gpu.cpp:

  1. The code that runs on the GPU (in WebGPU Shading Language, or WGSL), implementing the compute operation.

  2. The code that runs on the CPU (in C++) that sets up the GPU computation by allocating and preparing resources. For high performance, this code should be run ahead-of-time from the hot paths of the application.

  3. The code that runs on the CPU (in C++) that dispatches the GPU computation and retrieves the results. The key concern of hot-path dispatch code is to eliminate or minimize any unnecessary resource allocation or data movement (offloading such concerns to step 2). A secondary consideration is that GPU dispatches are asynchronous. We work with standard C++ asynchronous primitives to manage the asynchronous aspect of kernel dispatch.

Here's a GELU kernel implemented (based on the CUDA implementation in llm.c) as on-device WebGPU WGSL code and invoked from the host using gpu.cpp library functions and types. It can be compiled using a standard C++ compiler (we recommend Clang):

#include <array>
#include <cstdio>
#include <future>

#include "gpu.h"

using namespace gpu; // createContext, createTensor, createKernel,
                     // dispatchKernel, wait, toCPU Bindings,
                     // Tensor, Kernel, Context, Shape, kf32

static const char *kGelu = R"(
const GELU_SCALING_FACTOR: f32 = 0.7978845608028654; // sqrt(2.0 / PI)
@group(0) @binding(0) var<storage, read_write> inp: array<{{precision}}>;
@group(0) @binding(1) var<storage, read_write> out: array<{{precision}}>;
@compute @workgroup_size({{workgroupSize}})
fn main(
    @builtin(global_invocation_id) GlobalInvocationID: vec3<u32>) {
    let i: u32 = GlobalInvocationID.x;
    if (i < arrayLength(&inp)) {
        let x: f32 = inp[i];
        out[i] = select(0.5 * x * (1.0 + tanh(GELU_SCALING_FACTOR
                 * (x + .044715 * x * x * x))), x, x > 10.0);
    }
}
)";

int main(int argc, char **argv) {
  Context ctx = createContext();
  static constexpr size_t N = 10000;
  std::array<float, N> inputArr, outputArr;
  for (int i = 0; i < N; ++i) {
    inputArr[i] = static_cast<float>(i) / 10.0; // dummy input data
  }
  Tensor input = createTensor(ctx, Shape{N}, kf32, inputArr.data());
  Tensor output = createTensor(ctx, Shape{N}, kf32);
  std::promise<void> promise;
  std::future<void> future = promise.get_future();
  Kernel op = createKernel(ctx, {kGelu, /* 1-D workgroup size */ 256, kf32},
                           Bindings{input, output},
                           /* number of workgroups */ {cdiv(N, 256), 1, 1});
  dispatchKernel(ctx, op, promise);
  wait(ctx, future);
  toCPU(ctx, output, outputArr.data(), sizeof(outputArr));
  for (int i = 0; i < 16; ++i) {
    printf("  gelu(%.2f) = %.2f\n", inputArr[i], outputArr[i]);
  }
  return 0;
}

Here we see the GPU code is quoted in a domain specific language called WGSL (WebGPU Shading Language). In a larger project, you might store this code in a separate file to be loaded at runtime (see examples/shadertui for a demonstration of live WGSL code re-loading).

The CPU code in main() sets up the host coordination for the GPU computation. We can think of the use of gpu.cpp library as a collection of GPU nouns and verbs.

The "nouns" are GPU resources modeled by the type definitions of the library and the "verbs" actions on GPU resources, modeled by the functions of the library. The ahead-of-time resource acquisition functions are prefaced with create*, such as:

  • createContext() - constructs a reference to the GPU device context (Context).
  • createTensor() - acquires a contiguous buffer on the GPU (Tensor).
  • createKernel() - constructs a handle to resources for the GPU computation (Kernel), taking the shader code as input and the tensor resources to bind.

These resource acquisition functions are tied to resource types for interacting with the GPU:

  • Context - a handle to the state of resources for interacting with the GPU device.
  • Tensor - a buffer of data on the GPU.
  • KernelCode - the code for a WGSL program that can be dispatched to the GPU. This is a thin wrapper around a WGSL string and also includes the workgroup size the code is designed to run with.
  • Kernel - a GPU program that can be dispatched to the GPU. This accepts a KernelCode and a list of Tensor resources to bind for the dispatch computation. This takes an argument Bindings that is a list of Tensor instances and should map the bindings declared at the top of the WGSL code. In this example there's two bindings corresponding to the input buffer on the GPU and the ouptut buffer on the GPU.

In this example, the GELU computation is performed only once and the program immediately exits so preparing resources and dispatch are side-by-side. Other examples in the examples/ directory illustrate how resource acquisition is prepared ahead of time and dispatch occurs in the hot path like a render, model inference, or simulation loop.

Besides the create* resource acquisition functions, there are a few more "verbs" in the gpu.cpp library for handling dispatching execution to the GPU and data movement:

  • dispatchKernel() - dispatches a Kernel to the GPU for computation. This is an asynchronous operation that returns immediately.
  • wait() - blocks until the GPU computation is complete. This is a standard C++ future/promise pattern.
  • toCPU() - moves data from the GPU to the CPU. This is a synchronous operation that blocks until the data is copied.
  • toGPU() - moves data from the CPU to the GPU. This is a synchronous operation that blocks until the data is copied. In this particular example, toGPU() is not used because there's only one data movement from CPU to GPU in the program and that happens when the createTensor() function is called.

This example is available in examples/hello_world/run.cpp.

Other Examples: Matrix Multiplication, Physics Sim, and SDF Rendering

You can explore the example projects in examples/ which illustrate how to use gpu.cpp as a library.

After you have run make in the top-level directory which retrieves the prebuilt Dawn shared library, you can run each example by navigating to its directory and running make from the example's directory.

An example of tiled matrix multiplication is in examples/matmul. This implements a WebGPU version of the first few kernels of Simon Boehm's How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog post. It currently runs at ~ 2.5+ TFLOPs on a Macbook Pro M1 Max laptop, which has a theoretical peak of 10.4 TFLOPs. Contributions to optimize this further are welcome.

A parallel physics simulation of an ensemble of double pendulums simulated in parallel with different initial conditions on the GPU is shown in examples/physics.

matmul example output physics example animated gif

We also show some examples of signed distance function computations, rendered in the terminal as ascii. A 3D SDF of spheres is shown in examples/render and a shadertoy-like live-reloading example is in examples/shadertui.

Interestingly, given a starting example, LLMs such as Claude 3.5 Sonnet can be quite capable at writing low-level WGSL code for you - the other shaders in the shadertui example are written by the LLM.

shadertui example animated gif

Who is gpu.cpp for?

gpu.cpp is aimed at enabling projects requiring portable on-device GPU computation with minimal implementation complexity and friction. Some example use cases are:

  • Development of GPU algorithms to be run on personal computing devices
  • Direct standalone implementations of neural network models
  • Physics simulations and simulation environments
  • Multimodal applications - audio and video processing
  • Offline graphics rendering
  • ML inference engines and runtimes
  • Parallel compute intensive data processing applications

Although gpu.cpp is meant for any general purpose GPU computation and not strictly AI, one area we're interested in is pushing the limits exploring the intersection of new algorithms for post-training and on-device compute.

To date, AI research has primarily been built with CUDA as the privileged first-class target. CUDA has been dominant at large scale training and inference but at the other end of the the spectrum in the world of GPU compute on personal devices, there exists far more heterogeneity in the hardware and software stack.

GPU compute in this personal device ecosystem has been largely limited to a small group of experts such as game engine developers and engineers working directly on ML compilers or inference runtimes. Along with that, implementing against the Vulkan or even WebGPU API directly tends to be targeted mostly towards infrastructure scale efforts - game engines, production ML inference engines, large software packages.

We want to make it easier for a broader range of projects to harness the power of GPUs on personal devices. With a small amount of code, we can access the GPU at a low-level, focusing on directly implementing algorithms rather than the scaffolding and tech stack around the GPU. For example, in our AI research there's much to explore with the various forms of dynamic/conditional post-training computation - dynamic use of adapters, sparsity, model compression, realtime multimodal integrations etc.

gpu.cpp lets us implement and drop-in any algorithm with fine-grained control of data movement and GPU code, and explore outside boundaries of what is supported by existing production-oriented inference runtimes. At the same time we can write code that is portable and immediately usable on a wide variety of and GPU vendors and compute form factors - workstations, laptops, mobile, or even emerging hardware platforms such as AR/VR and robotics.

What gpu.cpp is not

gpu.cpp is meant for developers with some familiarity with C++ and GPU programming. It is not a high-level numerical computing or machine learning framework or inference engine, though it can be used in support of such implementations.

Second, in spite of the name, WebGPU has native implementations decoupled from the web and the browser. gpu.cpp leverages WebGPU as a portable native GPU API first and foremost, with the possibility of running in the browser being a convenient additional benefit in the future.

If you find it counterintuitive, as many do, that WebGPU is a native technology and not just for the web, watch Elie Michel's excellent talk "WebGPU is Not Just About the Web".

Finally, the focus of gpu.cpp is general-purpose GPU computation rather than rendering/graphics on the GPU, although it can be useful for offline rendering or video processing use cases. We may explore directions with graphics in the future, but for now our focus is GPU compute.

Limitations and Upcoming Features

API Improvements - gpu.cpp is a work-in-progress and there are many features and improvements to come. At this early stage, we expect the API design to evolve as we identify improvements / needs from use cases. In particular, the handling of structured parameters and asynchronous dispatch will undergo refinement and maturation in the short-term.

Browser Targets - In spite of using WebGPU we haven't tested builds targeting the browser yet though this is a short-term priority.

Reusable Kernel Library - Currently the core library is strictly the operations and types for interfacing with the WebGPU API, with some specific use case example WGSL implementations in examples/. Over time, as kernel implementations mature we may migrate some of the reusable operations from specific examples into a small reusable kernel library.

More Use Case Examples and Tests - Expect an iteration loop of use cases to design tweaks and improvements, which in turn make the use cases cleaner and easier to write. One short term use cases to flesh out the kernels from llm.c in WebGPU form. As these mature into a reusable kernel library, we hope to help realize the potential for WebGPU compute in AI.

Troubleshooting

If you run into issues building the project, please open an issue.

Acknowledgements

gpu.cpp makes use of:

Thanks also to fellow colleagues at Answer.AI team for their support, testing help, and feedback.

Discord Community and Contributing

Join our community in the #gpu-cpp channel on the AnswerDotAI Discord with this invite link. Feel free to get in touch via X @austinvhuang as well.

Feedback, issues and pull requests are welcome.

Code Guidelines for Contributors

For contributors, here are general rules of thumb regarding the design and style of the gpu.cpp library:

Aesthetics - Maximize Leverage and Account for Sources of Friction:

  • In addition to performance, time-to-grok the codebase, compilation time, number of failure modes for builds are things worth optimizing for.
  • Increase the implementation surface area only when there's a clear goal behind doing so. This maximizes leverage per unit effort, increases optionality in how the library can be used, and keeps compile times low.
  • Taking inspiration from the time-tested horizontal extensibility of neural network libraries like PyTorch, to a first approximation the library architecture could be described as a bag of composable functions.
  • Design choices general attempt to blend the composability of functional programming with the performance awareness of data oriented design.

Overloads and Templates:

  • Prefer value-level types over type-level templates, especially for core implementation code. It's easy to add a more typesafe templated wrapper around a value type core implementation. Whereas moving templated core implementations from comptime to runtime leads to a more significant refactor.
  • For comptime polymorphism, prefer trivial function overloads over templates. Besides compile time benefits, this reasoning about which version of a function is being called becomes explicit and scanable in the codebase.

Avoid Encapsulation and Methods:

  • To build systems effectively, we need to construct them out of subsystems for which the behavior is known and thereby composable and predictable. Therefore, we prefer transparency and avoid encapsulation. Don't use abstract classes as interface specifications, the library and its function signatures is the interface.
  • Use struct as a default over class unless there's a clear reason otherwise.
  • Instead of methods, pass the "owning object" object as a reference to a function. In general this convention can perform any operation that a method can, but with more flexibility and less coupling. Using mutating functions generalizes more cleanly to operations that have side effects on more than one parameter, whereas methods priveledge the the owning class, treating the single variable case as a special case and making it harder to generalize to multiple parameters.
  • Methods are usually only used for constructor/destructor/operator priveledged cases.
  • For operations requesting GPU resources and more complex initialization, use factory functions following the create[X] convention - createTensor, createKernel, createContext etc.
  • Use (as-trivial-as-possible) constructors for simple supporting types (mostly providing metadata for a dispatch) Shape, KernelCode, etc.

Ownership:

  • Prefer stack allocation for ownership, use unique_ptr for ownership when the heap is needed. Use raw pointers only for non-owning views. Avoid shared_ptr unless there's a clear rationale for shared ownership.
  • Use pools as a single point of control to manage sets of resources. Consider incorporating a pool in Context if the resource is universal enough to the overall API.

Separating Resource Acquisition from Hot Paths:

  • In general, resource acquisition should be done ahead of time from the hot paths of the application. This is to ensure that the hot paths are as fast as possible and don't have to deal with resource allocation or data movement.
  • Operations in the API should be implemented with a use in mind - typically either ahead-of-time resource preparation/acquisition, hot-paths, or non-critical testing/observability code.

gpu.cpp's People

Contributors

austinvhuang avatar michealreed avatar junjihashimoto avatar kagurazaka-ayano avatar sarahpannn avatar ichowenyu avatar fadichahineos avatar kooshyar avatar minujeong avatar sadikneipp avatar arozrai avatar

Stargazers

Simão Reis avatar Alejandro Exojo avatar Illia Dulskyi avatar Maciek Malik avatar  avatar  avatar jkang avatar yehangyang avatar Andrew Golovashevich avatar William Johnson avatar  avatar 李炎龙 avatar 睡觉型学渣 avatar Zhanghao Wu avatar LiDanyang avatar  avatar  avatar Samuel Umoh avatar Chukwuemeka Nelson Nwauche avatar Manuel Menzella avatar  avatar  avatar Link-K avatar Mudassiruddin avatar Jeffrey Wardman avatar  avatar  avatar Abe avatar Mike avatar Benjamin Jurk avatar xingray avatar  avatar  avatar Unchun Yang avatar Ravikul Rao avatar Evan Owen avatar Dương Bảo Duy avatar  avatar  avatar  avatar gangan avatar sansay avatar  avatar ning fc avatar xiaoquan wang avatar hipotures avatar Said Cemal Türkdoğan avatar laojian avatar  avatar Kenny Guo avatar Ștefan-Mihai MOGA avatar Zhuan avatar Chenghua avatar Dominik Antal avatar jasinco avatar Hongbiao avatar Shawn Zhong avatar nguyenhothanhtam avatar huahuahua avatar Kobi Cohen-Arazi avatar Francisco Griman avatar MicroBlock avatar  avatar day avatar Sardorbek Abduraimov avatar Aaryaman Vasishta avatar Gutem avatar Piotr Woliński avatar Oanakiaja avatar  avatar Sebastjan avatar  avatar Taha Yassine avatar  avatar  avatar  avatar  avatar  avatar  avatar Dante Oz avatar lour_gesicht avatar  avatar  avatar Kaunil Dhruv avatar Pupa avatar  avatar Renlong  avatar Ezo Saleh avatar Amir Kouhkan  avatar Luca Lolli avatar mindphical avatar ykikoqAq avatar Hongcheng Zhu avatar Poli Sai Aditya Reddy avatar Siarhei avatar Abdulhamit Çelik avatar Sam Lowe avatar Kuro Latency avatar Kerim avatar Frozen Forest Reality Technologies avatar

Watchers

wooley avatar ngn999 avatar Kryspin Ziemski avatar james h avatar HoNooD avatar Deng Jinlong avatar 张巨帅 avatar Andrii Tsok avatar Vik Paruchuri avatar Mike avatar Alex Wu avatar 吴铭 avatar Dmitry Atamanov avatar ey avatar  avatar  avatar page avatar  avatar  avatar Adam Paszke avatar hhhaiai avatar Songpeng Zu avatar Jonathan Whitaker avatar rorosan avatar  avatar Griffin Adams avatar Albert Coder avatar Eric Field avatar Benjamin Clavié avatar  avatar Rajesh Karra avatar WangNing avatar  avatar shuyueW1991 avatar Shuli Shu avatar Jesse C. Lin avatar LEM Dong avatar B.P aka Mince avatar  avatar  avatar  avatar Nihaal Virgincar avatar  avatar  avatar  avatar

gpu.cpp's Issues

Cannot find 'array' file of stdlib.

It seems that -stdlib=libc++ is required on ubuntu-22.04.

$ clang++ -std=c++17 -I/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../.. -I/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../../utils -I/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../../third_party/headers -L/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../../third_party/lib run.cpp -ldl -ldawn -o ./build/mm
run.cpp:1:10: fatal error: 'array' file not found
#include <array>
         ^~~~~~~
1 error generated.
$ clang++ -std=c++17 -I/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../.. -I/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../../utils -I/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../../third_party/headers -L/home/junji-hashimoto/git/gpu.cpp/examples/matmul/../../third_party/lib run.cpp -ldl -ldawn -o ./build/mm -stdlib=libc++
$ clang++ --version
Ubuntu clang version 14.0.0-1ubuntu1.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"

web browser build example works on OSX but fails with linux x86

examples/web works on mac, but on ubuntu linux (x86 on an x1 nano), the adapter request fails (TODO(avh): check error code):

Hello gpu.cpp!
--------------

[info] Requesting adapter
[error] Error in file ../../gpu.h line 704:
Request WebGPU adapter
program exited (with status: 1), but keepRuntimeAlive() is set (counter=0) due to an async operation, so halting execution but not exiting the runtime or preventing further async execution (you can use emscripten_force_exit, if you want to force a true shutdown)

Only Get 0.8Tflops on Example gpu.cpp/examples/matmul, while >= 2.9TFlops is expected

Machine: Macbook Air M2

Theoretical GPU TFlops: 3.6TFlops

Actual TFlops:

[info] Dispatching Kernel version 7, 30 iterations ...
[info] Copying result to CPU
[info] 

Output[0] (4096, 8192)

    75.62   -41.09    44.66   -38.75 ..    77.54  -148.38   118.80    11.59
   -51.14    41.22    63.68   -85.85 ..    10.33    46.63   -37.63   -44.94
    40.48    47.92    -0.86    56.20 ..    28.35    80.12   -62.48   -70.48
    90.25  -125.13   -51.20    64.34 ..   -17.11   -20.05   -58.04    18.76
...
    40.71    83.20  -107.86   -51.57 ..    40.67   -34.96  -117.75   115.25
    -9.16   -35.50   125.30    20.48 ..   -95.57    53.38  -129.10    76.58
     6.63   -66.93    30.76   -35.62 ..    -9.20   -59.73     7.04    19.37
     9.48   -27.52    -9.45   -12.71 ..    33.74   -79.34   -68.20   -22.78


[info] 

================================================================================
Execution Time: (M = 4096, K = 4096, N = 8192) x 30 iterations :
643.3 milliseconds / dispatch ~ 887.28 GFLOPS
================================================================================```

Unable to build with cmake FetchContent

Expected: Use FetchContent_Delcare to declare the dependency and then use FetchContent_MakeAvailable and the built library and headers will be in the _deps/gpucpp-{build/subbuild/src}

Actual: The generating process ends up with an error. Error message and minimal reproducible CMakeLists.txt is inside the gist

The dawn library file doesn't appear in the third_party/lib directory. After I built this and put the libdawn.dylib into the third_party/lib directory the cmake command works.

Bindings - Export Functions for FFI

With #16 we are building the recipe for our libraries but nothing is exported. The task is to handle exports within gpu.h directly for the varying platforms or to provide an additional file with the exports.

Support for 4bit or 8bit tensor

Although it may be out of scope, it would be nice to have an example of computing 4bit and 8bit tensors, to save memory bandwidth.

No aarch64 build for libdawn.so

On Linux aarch64, an improper libdawn.so for x86_64 is downloaded which is not compatible.
How to build libdawn.so for aarch64, is there a user guide or official support?

Error building libprotobuf-mutator on macos

Compilation error occurs when building libprotobuf-mutator on macos.

$ make 
...
[ 21%] Building CXX object third_party/dawn-build/src/tint/CMakeFiles/tint_utils_rtti.dir/utils/rtti/castable.cc.o
In file included from /Users/junji.hashimoto/git/gpu.cpp/third_party/dawn-src/third_party/libprotobuf-mutator/src/src/mutator.cc:27:
/Users/junji.hashimoto/git/gpu.cpp/third_party/dawn-src/third_party/libprotobuf-mutator/src/src/field_instance.h:195:34: error: no member named 'syntax' in 'google::protobuf::FileDescriptor'
           descriptor()->file()->syntax() ==
           ~~~~~~~~~~~~~~~~~~~~  ^
/Users/junji.hashimoto/git/gpu.cpp/third_party/dawn-src/third_party/libprotobuf-mutator/src/src/field_instance.h:196:42: error: no member named 'SYNTAX_PROTO3' in 'google::protobuf::FileDescriptor'
               protobuf::FileDescriptor::SYNTAX_PROTO3;
               ~~~~~~~~~~~~~~~~~~~~~~~~~~^
/Users/junji.hashimoto/git/gpu.cpp/third_party/dawn-src/third_party/libprotobuf-mutator/src/src/mutator.cc:92:24: error: no member named 'syntax' in 'google::protobuf::FileDescriptor'
  return field.file()->syntax() == FileDescriptor::SYNTAX_PROTO3 &&
         ~~~~~~~~~~~~  ^
/Users/junji.hashimoto/git/gpu.cpp/third_party/dawn-src/third_party/libprotobuf-mutator/src/src/mutator.cc:92:52: error: no member named 'SYNTAX_PROTO3' in 'google::protobuf::FileDescriptor'
  return field.file()->syntax() == FileDescriptor::SYNTAX_PROTO3 &&
                                   ~~~~~~~~~~~~~~~~^
...
$ uname -a
Darwin G-PC-01239990 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64 arm Darwin
$ brew info protobuf
==> protobuf: stable 27.0 (bottled)
...

Typos in the documentation

#Typos in base dir README.md file

  • "opporation" should be "operation"
    In the sentence: "The code that runs on the GPU (in WebGPU Shading Language, or WGSL), implementing the compute opporation."

  • "priveledged" should be "privileged"
    In the sentence: "To date, AI research has primarily been built with CUDA as the priveledged first-class target."

  • "infrastrcture" should be "infrastructure"
    In the sentence: "Along with that, implementing against the Vulkan or even WebGPU API directly tends to be targeted mostly towards infrastrcture scale efforts - game engines, production ML inference engines, large software packages."

CUDA vs WGSL

CUDA's matmul is 1.5 times faster than WGLS's one.
I would like to know what the performance overhead is.

WGSL

~/git/gpu.cpp/examples/matmul (main)
$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU (UUID: GPU-d69302e7-ae22-5fbc-39a3-11f66a27d0ac)
$ git rev-parse HEAD
5e030eb142fd25d92c2a64d338bd84265ba1106e
$ git diff
diff --git a/examples/matmul/run.cpp b/examples/matmul/run.cpp
index 4e61968..ae9a127 100644
--- a/examples/matmul/run.cpp
+++ b/examples/matmul/run.cpp
@@ -552,9 +552,9 @@ Kernel selectMatmul(Context &ctx, int version,
     kernel = createKernel(ctx, matmul, bindings,
                           /*nWorkgroups*/ nWorkgroups);
   } else if (version == 4 || version == 6) {
-    static constexpr size_t BM = 64;
+    static constexpr size_t BM = 128;
     static constexpr size_t BK = 16;
-    static constexpr size_t BN = 64;
+    static constexpr size_t BN = 128;
     static constexpr size_t TM = BM / BK;
     static constexpr size_t TN = BN / BK;
     Shape wgSize = {(BM / TM) * (BN / TN), 1, 1}; // This is the same as BK * BK.
$ make | grep GFLOPS
138.5 milliseconds / dispatch ~ 1984.94 GFLOPS

CUDA (An equivalent implementation of WGSL's matmul)

The CUDA matmul code is in the gist(https://gist.github.com/junjihashimoto/3a3020797076f8b5a0b4afcf0b448b93).

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU (UUID: GPU-d69302e7-ae22-5fbc-39a3-11f66a27d0ac)
$ nvcc matmul.cu
$ ./a.out
Execution time: 355.806 ms
Execution time / iterations: 71.1612 ms
GFLOPS: 3862.75

Looking at the results below, it seems that this implementation can improve performance by 2x.

CUDA (Simon Boehm's kernels)

FYI, the performance of Simon Boehm's kernels is as follows:

image

The GPU is "NVIDIA GeForce RTX 3080 Laptop GPU".

build refactor patches needed

Follow-up to #16 Few issues i'm running into so far:

  • default target for make should be clang/make based and not require cmake
  • cmake 3.28 is a bit high (on my system i have 3.25.1 installed) - what's the lowest we can get away with
  • find argument usage isn't portable, on mac i get the error- find: -executable: unknown primary or operator
  • I think the dollar dereferencing $$exepath might work different on mac, getting /bin/sh: xe_path: command not found errors.
  • all examples should depend on dawnlib and run setup.py automatically if the shared library is not there
  • there seem to be cross dependencies between examples, eg run_hello_world seems to be doing a find for matmul.

Feature request: Docs

Thank you for your fantastic work!
It would help a lot to new users to have docs, API documentation. For example the correspondence between C++ data types and WebGPU tensors. Which one is more optimal and some other guidelines on how to write efficient code

gcc with c++17 works fine

I tested on my linux box. It works fine.

$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.