Giter Site home page Giter Site logo

ingonyama-zk / icicle Goto Github PK

View Code? Open in Web Editor NEW
316.0 15.0 92.0 52.51 MB

a GPU Library for Zero-Knowledge Acceleration

License: MIT License

Cuda 12.53% Rust 4.69% CMake 0.14% Makefile 0.05% C 1.57% Go 8.11% Shell 0.10% Dockerfile 0.01% Nix 0.01% C++ 72.28% Python 0.12% PowerShell 0.01% Sage 0.39%
cuda rust zero-knowledge gpu

icicle's Introduction

ICICLE

ICICLE is a library for ZK acceleration using CUDA-enabled GPUs.

ICICLE

Chat with us on Discord Follow us on Twitter GitHub Release

Background

Zero Knowledge Proofs (ZKPs) are considered one of the greatest achievements of modern cryptography. Accordingly, ZKPs are expected to disrupt a number of industries and will usher in an era of trustless and privacy preserving services and infrastructure.

We believe GPUs are as important for ZK as for AI.

  • GPUs are a perfect match for ZK compute - around 97% of ZK protocol runtime is parallel by nature.
  • GPUs are simple for developers to use and scale compared to other hardware platforms.
  • GPUs are extremely competitive in terms of power / performance and price (3x cheaper).
  • GPUs are popular and readily available.

Getting Started

ICICLE is a CUDA implementation of general functions widely used in ZKP.

Note

Developers: We highly recommend reading our documentation

Tip

Try out ICICLE by running some examples using ICICLE in C++ and our Rust bindings

Prerequisites

  • CUDA Toolkit version 12.0 or newer.
  • CMake, version 3.18 and above. Latest version is recommended.
  • GCC version 9, latest version is recommended.
  • Any Nvidia GPU (which supports CUDA Toolkit version 12.0 or above).

Note

It is possible to use CUDA 11 for cards which don't support CUDA 12, however we don't officially support this version and in the future there may be issues.

Accessing Hardware

If you don't have access to an Nvidia GPU we have some options for you.

Checkout Google Colab. Google Colab offers a free T4 GPU instance and ICICLE can be used with it, reference this guide for setting up your Google Colab workplace.

If you require more compute and have an interesting research project, we have bounty and grant programs.

Build systems

ICICLE has three build systems.

ICICLE core always needs to be built as part of the other build systems as it contains the core ICICLE primitives implemented in CUDA. Reference these guides for the different build systems, ICICLE core guide, ICICLE Rust guide and ICICLE Golang guide.

Compiling ICICLE

Running ICICLE via Rust bindings is highly recommended and simple:

  • Clone this repo
    • go to our Rust bindings
    • Enter a curve implementation
    • run cargo build --release to build or cargo test to build and execute tests

In any case you would want to compile and run core icicle c++ tests, just follow these setps:

  • Clone this repo
    • go to ICICLE core
    • execute the small script to compile via cmake and run c++ and cuda tests

Docker

We offer a simple Docker container so you can simply run ICICLE without setting everything up locally.

docker build -t <name_of_your_choice> .
docker run --gpus all -it <name_of_your_choice> /bin/bash

Contributions

Join our Discord Server and find us on the icicle channel. We will be happy to work together to support your use case and talk features, bugs and design.

Development Contributions

If you are changing code, please make sure to change your git hooks path to the repo's hooks directory by running the following command:

git config core.hooksPath ./scripts/hooks

In case clang-format is missing on your system, you can install it using the following command:

sudo apt install clang-format

You will also need to install codespell to check for typos.

This will ensure our custom hooks are run and will make it easier to follow our coding guidelines.

Hall of Fame

  • Robik, for his ongoing support and mentorship
  • liuxiao, for being a top notch bug smasher
  • gkigiermo, for making it intuitive to use ICICLE in Google Colab
  • nonam3e, for adding Grumpkin curve support into ICICLE
  • alxiong, for adding warmup for CudaStream
  • cyl19970726, for updating go install source in Dockerfile
  • PatStiles, for adding Stark252 field

Help & Support

For help and support talk to our devs in our discord channel "ICICLE"

License

ICICLE is distributed under the terms of the MIT License.

See LICENSE-MIT for details.

icicle's People

Contributors

alxiong avatar bigsky77 avatar cangqiaoyuzhuo avatar chickenlover avatar cyl19970726 avatar dmytrotym avatar ethan-000 avatar gkigiermo avatar guy-ingo avatar hadaringonyama avatar immanuelsegol avatar jeremyfelder avatar jimmyhongjichuan avatar krakhit avatar leonhibnik avatar liuhao230 avatar liuxiaobleach avatar nonam3e avatar omahs avatar omershlo avatar otsar-raikou avatar patstiles avatar punk5736 avatar robik75 avatar sukrucildirr avatar svpolonsky avatar vhnatyk avatar vladfdp avatar weijiekoh avatar yshekel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

icicle's Issues

[FEAT]: Add benchmarks for ICICLE primitives

Description

Add benchmarks for:

  • MSM
  • NTT
  • ECNTT
  • Scalar Vector Multiplication
  • Point Vector Multiplication

Motivation

Benchmarks allow measuring performance and comparing with CPU implementation

[BUG]: Linker issue in fp_config with Rust executable

Description

Linker issue with Rust of fp_config fields that are array of array of limbs

~/Projects/ingo-cuda-1_main/target/release/build/icicle-utils-6b3c082041b18c7f/out/libingo_icicle.a(ntt.o): in function `Projective<Field<fq_config>, 4u>* ntt_template<Projective<Field<fq_config>, 4u>, Field<fp_config> >(Projective<Field<fq_config>, 4u>*, unsigned int, Field<fp_config>*, unsigned int, bool)':
          tmpxft_0000fcfc_00000000-7_ntt.cudafe1.cpp:(.text._Z12ntt_templateI10ProjectiveI5FieldI9fq_configELj4EES1_I9fp_configEEPT_S8_jPT0_jb[_Z12ntt_templateI10ProjectiveI5FieldI9fq_configELj4EES1_I9fp_configEEPT_S8_jPT0_jb]+0x495): undefined reference to `fp_config::inv'
          /usr/bin/ld: ~/Projects/ingo-cuda-1_main/target/release/build/icicle-utils-6b3c082041b18c7f/out/libingo_icicle.a(ntt.o): in function `ecntt_end2end':
          tmpxft_0000fcfc_00000000-7_ntt.cudafe1.cpp:(.text.ecntt_end2end+0x68): undefined reference to `fp_config::omega_inv'
          /usr/bin/ld: tmpxft_0000fcfc_00000000-7_ntt.cudafe1.cpp:(.text.ecntt_end2end+0x173): undefined reference to `fp_config::omega

Reproduce

cargo test --release on #20 without this hotfix

Expected Behavior

linking without issues

Environment

OS + Version: Ubuntu 20.04

Cargo Version: cargo 1.67.0 (8ecd4f20a 2023-01-10)

GPU type: Laptop 3050Ti (NVIDIA_DEV.2583, GN20-P1)

Additional context

as discussed with the team, a cleaner version of this hotfix needed

[FEAT]: Add support for Pasta curves

Description

Add support for Pasta curves (Pallas and Vesta)

Motivation

Pasta curves are great for recursive SNARKs and are actively used in modern proving systems (e.g. Nova)

MSM sum reduction optimization

The sum_reduction kernel for short MSM implements sequential addressing for contigious memory access to avoid shared memory bank conflicts. This performs better than naive warp divergence (generally ~2x faster), but there are other optimizations that can be made, and applied to the final_accumulation_kernel for the large MSM as well.

For example, in the sum reduction kernel, half the threads are idle after the first pass in the loop. Optimizations 4 and 5 in the guide address below this by launching half the number of threads and packing more work per thread, along with loop unrolling the last warp calculation.

https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Here's some example code below: https://github.com/TalDerei/cuda-barretenberg/blob/msm/src/aztec/gpu/benchmark/tests/msm/example_test.cu

[BUG]: Resolve Field omega case

Description

Field::omega uses static switch-case as a hotfix for some previous issue

static constexpr HOST_INLINE Field omega(uint32_t log_size) {

This fix makes is a hardcoded length of switch-cases, solve it by removing the cases and calling a value from array of omegas in relevant params.cu for each curve

[FEAT]: Poseidon hash implementation

Description

Implement the Poseidon hash function in CUDA

Motivation

Poseidon is a widely used primitive required by many (if not the most of) zero-knowledge systems

[FEAT]: NTT batch

Description

batched NTT function

Motivation

Faster option to call many ntt

[FEAT]: Implement binary shifts for scalars

Description

Implement binary shift methods for scalars here

Motivation

There some protocols that require these operations over scalars. e. g. computation of some domain tags for the poseidon hash function

[FEAT]: Resolve compiler warnings

Description

Cleanup warnings

Motivation

currently, some compiler warnings are present, they can highlight possible errors as well as obscure output

double-and-add algorithm

The addition operation in projective coordinates implements operator overloading that computes a scalar multiplication using a naive double-and-add algorithm. Would be worth specializing the operation by switching it with Montgomery multiplication (CIOS) for more efficient multiplications.

Here's an efficient implementation: https://github.com/MinaProtocol/gpu-groth16-prover-3x/blob/master/multiexp/arith.cu#L289. It uses cooperative groups and thread ranks in cuda. For your codebase, it would simply require changing out the 'fixnum' calls to equivalent PTX instructions you implement in 'ptx.cuh'.

[FEAT]: NUM_THREADS causes slow performance???

I'm curious about the change of NUM_THREADS.
It seems to cause slow performance.

//split scalars into digits
NUM_THREADS = 1 << 10;
NUM_BLOCKS = (size * (nof_bms+1) + NUM_THREADS - 1) / NUM_THREADS;
split_scalars_kernel<<<NUM_BLOCKS, NUM_THREADS>>>(bucket_indices + size, point_indices + size, scalars, size, msm_log_size, nof_bms, bm_bitsize, c); //+size - leaving the first bm free for the out of place sort later

//launch the accumulation kernel with maximum threads
NUM_THREADS = 1 << 8;
NUM_BLOCKS = (nof_buckets + NUM_THREADS - 1) / NUM_THREADS;
accumulate_buckets_kernel<<<NUM_BLOCKS, NUM_THREADS>>>(buckets, bucket_offsets, bucket_sizes, single_bucket_indices, point_indices, points, nof_buckets, 1, c+bm_bitsize);

Pinned host memory

Currently using pageable memory using cudaMalloc. Switching to pinned host memory using cudaMallocHost will perform better with higher bandwidth.

update readme

update readme to reflect that "Icicle" directory doesn't need "lib.cu" after refactoring build.

[FEAT]: Cuda graphs

Description

Add Cuda graph support

Motivation

GPU utilization and performance improvement

[BUG]: Detect GPU arch at config time or plz mention in README

Description

Was building and running the CUDA test suite; every test but one was failing with 500 (named symbol not found). It went over my head for a minute

Reproduce

Build & run tests with sub 80 hardware

Expected Behavior

For the tests to run and complete calls to CUDA kernels

Environment

WSL2, gcc 11.3, CUDA 12.1, CMake 3.22.1

OS + Version:

Ubuntu 22.04.1 LTS

Cargo Version:

Not installed

GPU type:

GTX-2060RTX

Additional context

Well, at least with CMake 3.24, this is available:

native architecture

[FEAT]:

Description

Build scheme

Motivation

Update build scheme for upcoming changes required for updated dank sharding

[FEAT]: BN-254 curve support

Description

Please provide a clear and concise description of the feature you would like included.

Motivation

Please provide a clear and concise description of the motivation for adding this feature.

[FEAT]: thread coarsening for larger msm sizes

Description

The current msm.cu implementation launches a monolithic kernel in several places like this:

unsigned NUM_THREADS = 1 << 10;
unsigned NUM_BLOCKS = (total_nof_buckets + NUM_THREADS - 1) / NUM_THREADS;
initialize_buckets_kernel<<<NUM_BLOCKS, NUM_THREADS>>>(buckets, total_nof_buckets);

This assumes the GPU has enough threads to process every element. Grid-strided for loops allow processing more data than GPU threads by making each thread do more work, and potentially allows maximum memory coalescing on larger inputs since we are continuously accessing consecutive memory. I think this should be one of the easier fixes to efficiently enable larger msm sizes.

Working on a pull request but getting comfortable with the msm Rust binding for testing first.

Motivation

From the icicle Discord I see "msm for large sizes" is a sprint priority.

MSM bucket segmentation

Improvement from PipeMSM paper to implement a segmented version of the buckets described in Algorithm 2. This is related to the big_triangle_sum_kernel kernel.

Coordinate conversions

Codebase implements conversion between affine and projective coordinates. I think it would be helpful to add helper methods to conversions between jacobian and projective coordinates as well.

[FEAT]: KZG openings

Description

Currently we don't have the functionality to open committed polynomials at arbitrary points.

Motivation

In Danksharding, KZG openings are performed at specific carefully chosen points, which makes these openings easier than in the general case. But for many protocols, like general-purpose ZK-SNARKs, KZG openings at arbitrary points are required, and we'll need to support them in the future.

Note: there are two versions of openings: if polynomial is given by its coefficients, the opening is done by Horner's rule and committing to the quotient, and if polynomial is in the evaluation form, dividing by the evaluations of monomial and computing the MSM afterwards is required.

MSM benchmarking inquiry

Can someone benchmark the MSM on 2^26 scalars and points, and share their results? I'm getting some mixed numbers.

Have you tested how the bucket method implementation compares against the ZPrize reference implementation in terms of execution time / memory utilization?

@HadarIngonyama
@LeonHibnik
@vhnatyk

bucket accumulation clarification

In the accumulate_buckets_kernel call, why do we launch this kernel with maximum of 2^8 threads? If you have "nof_bms << c" buckets, then you wouldn't have enough threads to allocate one thread per bucket. How does this not result in some buckets not being used?

I'm assuming 2^8 is because of register limits, but I'd think we'd need to launch extra blocks to compensate.

Edit: I was calculating the block formula with 2^10 threads, instead of 2^8. This is correct.

[FEAT]: Cuda streams

Description

Implement cuda streams

Motivation

Better gpu utilization and performance

Linking error?

I'm getting a linking error when trying run "Cargo test", resulting in "could not compile icicle-utils due to previous error; 15 warnings emitted" and "error: linking with cc failed: exit status: 1" errors. Building the ICICLE library using nvcc and running "cargo build --release" works, but the tests fail for some reason.

In the error logs, it's mentioning that "= note: some extern functions couldn't be found; some native libraries may need to be installed or have their path specified". Running only the MSM tests produce the same errors for example.

I've built the codebase by setting -arch=native (to autodetect the GPU), and arch=compute_80 (for the Ampere A10). Both builds produce the same error. I'm running on Linux Ubuntu 22.04 distribution.

Any insight into what's going on would be appreciated!

[BUG]: Zero point comparison

Description

point Equality comparison can be a bit misleading when zero points are involved

Reproduce

in Rust for example assertion fails
assert_ne!(Point::zero(), Point::one());

Expected Behavior

Aligned with Ark, should pass assertion above

[FEAT]: Cuda functions on device only

Description

Add (user selectable) feature to allow cuda functions use data already on gpu as inputs and keep results on the gpu

Motivation

Improve performance for running a series of gpu functions by not wasting time on data traffic from/to host cpu

Window size

Curious what the rational for the window size of c = 10-bit scalars (26 windows) is? It feels like this choice can be optimized?

Compilation time

Is there any way to speed up the slow Rust compilation time after making a change to the C++ functions? How is the dev team going about testing C++ functions via Rust FFI otherwise? Thanks!

split_scalars_kernel kernel function

The for loop in this kernel can be eliminated with the integration of cooperative groups. Instead of single thread looping over all the limbs for a single scalar, multiple threads can access a different limbs (or sub-parts of the same limb) of the same scalar in parallel. This would require refactoring the arithmetic to support multi-threaded field operations. This is a longer-term optimization worth looking into, and if it's right for your codebase.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.