lurk-lab / neptune Goto Github PK

View Code? Open in Web Editor NEW

267.0 23.0 96.0 11.55 MB

Rust Poseidon implementation.

License: Other

Rust 97.63% C 1.76% Cool 0.17% Makefile 0.22% Shell 0.23%

neptune's Introduction

Neptune

About

Neptune is a Rust implementation of the Poseidon hash function tuned for Filecoin.

Neptune has been audited by ADBK Consulting and deemed fully compliant with the paper (Starkad and Poseidon: New Hash Functions for Zero Knowledge Proof Systems).

Neptune was initially specialized to the BLS12-381 curve. Although the API allows for type specialization to other fields, the round numbers, constants, and s-box selection may not be correct. As long as the alternate field is a prime field of ~256 bits, the 128-bit security Neptune targets will apply. There is a run-time assertion which will fail if constants are generated for a field whose elements do not have a representation of exactly 32 byte. The Pasta Curves meet these criteria and are explicitly supported by Neptune.

At the time of the 1.0.0 release, Neptune on RTX 2080Ti GPU can build 8-ary Merkle trees for 4GiB of input in 16 seconds.

Implementation Specification

Filecoin's Poseidon specification is published in the Filecoin specification document here. Additionally, Markdown and PDF versions are mirrored in this repo in the spec directory.

Contributing to the Spec

PDF Rendering Instructions

The spec's PDF is rendered using Typora. Download the spec's Markdown file here, open the file in Typora, make and save your changes, then export the file as a PDF.

Ensuring Spec Documents Stay in Sync

When making changes to the spec documents in neptune, make sure that the spec's PDF file poseidon_spec.pdf is the PDF rendering of the Markdown spec poseidon_spec.md.

If you make changes to the spec in neptune, you must make those same changes to the Filecoin spec here, thus ensuring all three document's (one Markdown+Latex and one PDF in neptune and one Markdown+MathJax in filecoin-project/specs) stay in sync.

Environment variables

EC_GPU_FRAMEWORK=<cuda | opencl> allows to select whether the CUDA or OpenCL implementation should be used. If not set, cuda will be used if available.
EC_GPU_CUDA_NVCC_ARGS

By default the CUDA kernel is compiled for several architectures, which may take a long time. EC_GPU_CUDA_NVCC_ARGS can be used to override those arguments. The input and output file will still be automatically set.

// Example for compiling the kernel for only the Turing architecture
EC_GPU_CUDA_NVCC_ARGS="--fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75"

Rust feature flags

Neptune also supports batch hashing and tree building, which can be performed on a GPU. GPU batch hashing is implemented in pure CUDA/OpenCL. The pure CUDA/OpenCL batch hashing is provided by the internal proteus module. To use proteus, compile neptune with the opencl and/or cuda feature.

The cuda and opencl feature can be used independently or together. If both cuda and opencl are used, you can also select which implementation to use via the NEPTUNE_GPU_FRAMEWORK environment variable.

Arities

The CUDA/OpenCL kernel (enabled with the cuda/opencl feature) is generated with specific arities. Those arities need to be specified at compile-time via Rust feature flags. Available features are arity2, arity4, arity8, arity11, arity16, arity24, arity36. When the strengthened feature is enables, there will be an additional strengthened version available for each arity.

When using the cuda feature, the kernel is generated at compile-time. The more arities are used, the longer is the compile time. Hence, by default there are no specific arities enabled. You need to set at least one yourself.

Fields

The CUDA/OpenCL kernel (enabled with the cuda/opencl feature) is generated for specific fields. Those fields need to be specified at compile-time via Rust feature flags. Available features are bls for BLS12-381 and pasta for the Pallas and Vesta curves' scalar fields.

Running the tests

As the compile-time of the kernel depends on how many arities are used, there are no arities enabled by default. In order to run the test, all arities need to explicitly be enabled. To run all tests on e.g. the CUDA implementation, run:

cargo test --no-default-features --features cuda,bls,pasta,arity2,arity4,arity8,arity11,arity16,arity24,arity36

Benchmarking Poseidon by Field and Preimage Length

Benchmark Poseidon over the BLS12-381, Pallas, and Vesta scalar fields for preimages of length 2, 4, 8, or 11 using:

cargo bench arity-<preimage len>

Benchmark Poseidon over a specific field (bls, pallas, or vesta) and preimage length using:

cargo bench arity-<preimage len>/<field name>

Sponge API

Neptune implements the Secure Sponge API for Field Elements and serves as its reference implementation. The SpongeAPI trait defines the relevant API methods. See tests in source for simple examples of API usage with circuits and without circuits.

History

Neptune was originally bootstrapped from Dusk's reference implementation.

Changes

CHANGELOG

License

MIT or Apache 2.0

neptune's People

Contributors

Stargazers

Watchers

Forkers

finalitylabs luozijun isgasho newmai gyllone sshyran poc-foundation provablehq w3f zhangwangjin filcloud abmatrix 15012700225 paulip1792 sorasuegami filestar-project s1m0n21 ppiaas liangping eee-byte 2019jack xib1uvxi filecash apengn 3for daotlresearch openweb3-foundation skylovead zondax linguohua spector-in-london cryptonemo kubuxu strategist922 shenzhen-cloudatawalk-technology-co-ltd swasilyev mfdzh yatima-inc wqzhan1 shandiekeji woshidama323 porcuquine bhaskarvilles isabella232 web3-net iron-fish naye523 dryajov severiano-sisneros themighty1 robquistnl free1139 ipfs-force-community jorgeantonio21 pinkdiamond1 huitseeker chickenlover leonardoalt forestnode patricoferris vmx lyswifter samuelburnham winston-h-zhang liujiazheng spinhot56 kylegranger ekrembal gnull tyshko-rostyslav jobez gabriel-barrett gopherj vuittont60 shabbirhasan1 baldogarasd joesendaritay x0-s sahamdelfi jboldmr samtrader3388 zilong-dai ebihoseini tircimahmut

neptune's Issues

Error in the specification of the sparse_factorize function.

The documentation says that w is the first column of m_hat or second column of m without the first row..

Actually it is the first column of m without the first row.

The implementation is ok: https://github.com/filecoin-project/neptune/blob/bafd77a5014e3b6a6b40359097835c3eb1dd533f/src/mds.rs#L197

Env var flag to signal which GPU to use

I have a machine with an NVIDIA and an AMD GPU in it and I need a flag to tell Neptune which of them to use.

Potential implementation of Filecoin Poseidon in Golang?

In the new CID/CAR format, multihash is supported, and Poseidon has officially become a hash option.

However, it has not been implemented by many multihash implementations. Indeed, the go-multihash does not support Poseidon, neither does rust-multihash
https://github.com/multiformats/go-multihash/tree/master/register
https://github.com/multiformats/rust-multihash#supported-hash-types

Is neptune finalized? Is there a plan to implement Poseidon in multiple programming languages (FFI does not seem to be a good idea)?

MDS matrix security

The recent update of the Poseidon article drops in additional requirements on the MDS matrix security, see p. 7. Any idea if a randomly sampled Cauchy matrix over a large field is still safe?

Benchmark tracking

Issue for tracking benchmarks over time.

dev needs to be rebased on main

dev at 576049a does not include main at d463888

dev needs to be rebased on main

dev at eccf7cb does not include main at 6cae521

Error with AMD 570 GPU

cd gbench
cargo run

Compiling gbench v0.5.4 (/home/peware/neptune/gbench)
Finished dev [unoptimized + debuginfo] target(s) in 2m 43s
Running target/debug/gbench
[2020-07-13T00:50:57Z INFO gbench] KiB: 4194304
[2020-07-13T00:50:57Z INFO gbench] leaves: 134217728
[2020-07-13T00:50:57Z INFO gbench] max column batch size: 400000
[2020-07-13T00:50:57Z INFO gbench] max tree batch size: 700000
--> Run 0
[2020-07-13T00:50:57Z INFO gbench] Creating ColumnTreeBuilder
(some Futhark code): Could not find acceptable OpenCL device.

sudo lshw -C display
*-display
description: VGA compatible controller
product: Ellesmere [Radeon RX 470/480/570/570X/580/580X]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:05:00.0
version: ef
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: irq:61 memory:d0000000-dfffffff memory:cfe00000-cfffffff ioport:5000(size=256) memory:fdec0000-fdefffff memory:fde00000-fde1ffff
*-display UNCLAIMED
description: VGA compatible controller
product: ES1000
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 3
bus info: pci@0000:01:03.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pm vga_controller bus_master cap_list
configuration: latency=64 mingnt=8
resources: memory:c0000000-c7ffffff ioport:2000(size=256) memory:ed9f0000-ed9fffff memory:c0000-dffff

How to use multiple GPUs? For example, my machine has two NVIDIA GPU

Proposal: removing Futhark implementation

Neptune currently has an implementation called Triton which is implemented in Futhark. There is now an OpenCL/CUDA implementation called Proteus, with better performance. I propose removing Triton to lower the maintenance cost of this library.

dev needs to be rebased on main

dev at ac41cb7 does not include main at 308e62c

dev needs to be rebased on main

dev at c58fb8b does not include main at 5285991

dev needs to be rebased on main

dev at 9e67e35 does not include main at cfeaa9a

Add CUDA support

Neptune should also support to run on CUDA.

Feature Request: Implementation of Poseidon2 Hash Function

I would like to propose the implementation of the Poseidon2 hash function in the Neptune repository. This recent advancement enhances the efficiency of the Poseidon hash function, specifically tailored for zero-knowledge applications.

Referencing the research paper and the explanatory note provided by the authors, Poseidon2 enhances performance by focusing on its linear layers and round constant addition. This new design requires only a short chain of additions for computation, significantly reducing the number of multiplications and reductions.

Specifically,

Poseidon2 employs a fixed matrix for the external linear layers and another matrix for the internal linear layers, differing from the original Poseidon, which uses the same expensive MDS matrix in each linear layer.
Poseidon2 directly modifies the round constant addition, eliminating the need for the efficient representation required in the original Poseidon.

Given these improvements, Poseidon2 can offer a performance boost of up to a factor of 4 compared to the original Poseidon, without any increase in the number of rounds or other disadvantages. The reference implementation provided by HorizenLabs may be useful for this implementation.

Considering the focus of the Neptune repository on the Poseidon hash function, I believe that including Poseidon2 would greatly enhance its performance and efficiency.

dev needs to be rebased on main

dev at 79d1b84 does not include main at 98a7e6f

Fix description of Custom domain tag algorithm in comment.

00a87a7#r734019298

wasm target build error

Hi, could you pls suggest the correct way to build this lib for wasm. I tried
cargo build --target wasm32-unknown-unknown
and also with
cargo build --target wasm32-unknown-unknown --features "wasm"
but was getting compile errors:
MmapInner::map(self.get_len(file)?, file, self.offset).map(|inner| Mmap { inner: inner })
| ^^^^^^^^^ use of undeclared type MmapInner

Thanks.

Add a CITATION.cff to the repo

This allows Neptune to be cited properly in publications, see:
https://citation-file-format.github.io/

dev needs to be rebased on main

dev at b692e19 does not include main at 45510e4

Neptune uses an outdated reference script

Hi, fn generate_constants() https://github.com/filecoin-project/neptune/blob/2b11f0ce69f52aa9594f250baa658bfe2d349ac3/src/round_constants.rs#L26
references https://extgit.iaik.tugraz.at/krypto/hadeshash/blob/master/code/scripts/create_rcs_grain.sage
That file does not exist. An updated script exists in that repo with a notice of some fixed bugs.

Are there no security implications in not following the updated reference impl?

I was trying to reproduce the Poseidon constants which circomlib uses (they use the more recent script generate_parameters_grain.sage) and was unable to.

Upgrading ff and group crates

This is issue is meant as a reminder and not as an immediate action item. There are new releases of ff and group. Upgrading to those is a breaking change as they contain traits and you cannot really have the traits of two different versions in your dependency tree.

I propose postponing the upgrade until a new breaking release is needed for other reasons. This upgrade could then combined with such a release.

The upgrade is not straight forward, as also all other dependencies using those traits would need to be updates e.g. bellperson.

The upgrade would enable an upgrade of pasta_curves as well. The most recent release v0.5.0 should contain everything our current fork contains. This means once upgraded, we won't need to rely on a fork anymore.

help..

Sorry.
I want to know how to use it.
I have 2 GPUs.
I don't understand RUST and cargo, so I want to know the execution command.

chore: some installed deps are not needed

Some dependencies specified in Cargo.toml are not needed.

Check the unused dependencies sanity check workflow for details.

This issue was raised by the workflow at https://github.com/argumentcomputer/ci-workflows/tree/main/.github/workflows/unused-deps.yml.

Note
If this is a false positive, please refer to the cargo-udeps docs on how to ignore the dependencies.

Neptune does not currently build due to dependencies

There are broken dependencies in neptune that are causing build issues in https://github.com/filecoin-project/rust-filecoin-proofs-api

From neptune:

$ cargo update
    Updating crates.io index
error: failed to select a version for the requirement `rustc_version = "^0.1"`
candidate versions found which didn't match: 0.3.3, 0.3.2, 0.3.1, ...
location searched: crates.io index
required by package `fil-ocl-core v0.11.3`
    ... which is depended on by `fil-ocl v0.19.4`
    ... which is depended on by `rust-gpu-tools v0.3.0`
    ... which is depended on by `gbench v0.5.4 (...../neptune/gbench)`

From rust-filecoin-proofs-api:

$ cargo update
    Updating crates.io index
error: failed to select a version for the requirement `rustc_version = "^0.1"`
candidate versions found which didn't match: 0.3.3, 0.3.2, 0.3.1, ...
location searched: crates.io index
required by package `fil-ocl-core v0.11.3`
    ... which is depended on by `fil-ocl v0.19.4`
    ... which is depended on by `rust-gpu-tools v0.2.0`
    ... which is depended on by `bellperson v0.12.3`
    ... which is depended on by `filecoin-proofs-api v6.0.0 (...../rust-filecoin-proofs-api)`

chore: some installed deps are not needed

Some dependencies specified in Cargo.toml are not needed.

Check the unused dependencies sanity check workflow for details.

This issue was raised by the workflow at https://github.com/lurk-lab/ci-workflows/tree/main/.github/workflows/unused-deps.yml.

Note
If this is a false positive, please refer to the cargo-udeps docs on how to ignore the dependencies.

Reporting rebasing need only once

There's a GitHub action that checks if a rebase is needed. It posts a comment every hour.

I'm currently watching this repo and it would be great if there wouldn't be so many events happening if nothing really changes. I propose using something like https://github.com/peter-evans/find-comment to check if the comment already exists and in case it does, skipping to post another one.

column_tree_builder::tests::test_column_tree_builder ... LLVM ERROR

Wasn't sure how to title this bug report, so I hope it's good enough.

My configuration:
Manjaro Linux
Mesa Drivers with AMDGPU-PRO OpenCL library (this configuration works in other OpenCL applications/benchmarks such as Luxmark)
CPU: Ryzen 5950X
GPU: Radeon RX 6800 XT

Ran into this error when running benchy from rust-fil-proofs. @vmx advised me to file a bug report against neptune.

     Finished test [unoptimized + debuginfo] target(s) in 20.54s
     Running target/debug/deps/neptune-b1005e4d0551b26d

running 33 tests
test circuit::tests::test_poseidon_hash ... ok
test circuit::tests::test_scalar_product ... ok
test circuit::tests::test_scalar_product_with_add ... ok
test circuit::tests::test_square_sum ... ok
test column_tree_builder::tests::test_column_tree_builder ... LLVM ERROR: Cannot select: 0x55bca5ac5090: i32 = GlobalAddress<[4 x i64] addrspace(5)* @constinit.10> 0
In function: hash_2_standard
error: test failed, to rerun pass '--lib'

Caused by:
  process didn't exit successfully: `/home/chuck/git/neptune/target/debug/deps/neptune-b1005e4d0551b26d --test-threads=1` (signal: 6, SIGABRT: process abort signal)

Interestingly, when I installed ROCm from AUR, I get a different, but probably related error (technically, ROCm isn't support on RDNA cards, but I tried it after seeing a Phoronix article):

$ cargo test --features opencl,arity2,arity4,arity8,arity11,arity16,arity24,arity36 -- --test-threads=1
    Finished test [unoptimized + debuginfo] target(s) in 0.16s
     Running target/debug/deps/neptune-b1005e4d0551b26d

running 33 tests
test circuit::tests::test_poseidon_hash ... ok
test circuit::tests::test_scalar_product ... ok
test circuit::tests::test_scalar_product_with_add ... ok
test circuit::tests::test_square_sum ... ok
test column_tree_builder::tests::test_column_tree_builder ... LLVM ERROR: Cannot select: 0x5650f3109988: i32 = GlobalAddress<[4 x i64] addrspace(5)* @constinit.10> 0
In function: apply_round_matrix_2_standard
error: test failed, to rerun pass '--lib'

Caused by:
  process didn't exit successfully: `/home/chuck/git/neptune/target/debug/deps/neptune-b1005e4d0551b26d --test-threads=1` (signal: 6, SIGABRT: process abort signal)

Purpose of (Column)TreeBuilderTrait?

We currently have two tree builders, TreeBuilder and ColumnTreeBuilder, both with their own traits. AFAICT, there are only those implementations of those traits. Hence I wonder what the purpose of the ColumnTreeBuilderTrait and TreeBuilderTrait traits is?

I propose removing those traits and moving the implementation directly into the structs. Benefits:

Users of this library don't need to import those traits: Both tree builders, don't make much sense without implementing those traits. Usually users would import those structs as well as their traits.
No search/guessing where other implementations might be.
Easier to understand code, due to less abstractions.

Downsides:

Breaking change, users of the library would need to update their code.

chore: rust toolchain needs an upgrade

The rust version specified in rust-toolchain.toml (1.76.0) is out of date with the latest stable (1.80.1).

Check the rust version check workflow for details.

This issue was raised by the workflow at https://github.com/argumentcomputer/neptune/actions/runs/10606399386/workflow.

dev needs to be rebased on main

dev at 3248ce7 does not include main at 35f6fe3

Serialize/Deserialize Poseidon Constants

Hi - Nova's public parameters reference Neptune's Poseidon Hash Constants. We would like to serialize/deserialize Nova's public parameters using serde. What do you'll think about adding serde derive as an optional feature so that we can serialize/deserialize? It gets a bit tricky when it comes to serialize/deserialize of UInt from typenum so I was just wandering if anyone had any thoughts on that. Thanks.

dev needs to be rebased on main

dev at d6ef165 does not include main at ef14a61

dev needs to be rebased on main

dev at 89b24d5 does not include main at 3559d02

dev needs to be rebased on main

dev at f65b3e9 does not include main at cbcd0a4

dev needs to be rebased on main

dev at 5e85c74 does not include main at 70665ce

neptune for Pasta curves

The repo currently says: "Neptune is specialized to the BLS12-381 curve. Although the API allows for type specialization to other fields, the round numbers, constants, and s-box selection may not be correct. Do not do this."

Can we verify if the same set of constants would work for both curves in the Pasta cycle?

CI: push tests vs merge groups

Our current CI triggers on push to dev as well as merge groups:
https://github.com/lurk-lab/neptune/blob/9a6c931d158ebbfeb2a301f45d637642d65f0779/.github/workflows/check-downstream-compiles.yml
https://github.com/lurk-lab/neptune/blob/d8b4eeadd8acc9d9e8d9d510605c954f1410aa60/.github/workflows/rust.yml#L4-L9

the push tests are for the most part redundant,
the check-downstream-compiles job is meant only as a warning, and so is useless on merge_group or push.

(Potential) Mismatch with Poseidon Hash paper

Thanks for the wonderful work!

It seems there are a few (potential) mismatches between round_numbers.rs and the Poseidon Hash paper. Is there any reason for this mismatch? More specifically,

In Line 82, we are using let rf_interp = 0.43 * m + t.log2() - rp;. In Poseidon Hash paper, Eq (3) requires

For BLS12-381 with , M=128, and , we should add something like log_5(t) instead of t.log2().

Line 82 ~ 83 is also different from Eq (5) in Poseidon hash paper.

chore: rust toolchain needs an upgrade

The rust version specified in rust-toolchain.toml (1.75.0) is out of date with the latest stable (1.76.0).

Check the rust version check workflow for details.

This issue was raised by the workflow at https://github.com/lurk-lab/neptune/actions/runs/7879730700/workflow.

dev needs to be rebased on main

dev at 85f1d09 does not include main at dd09734

Allow odd full rounds

Currently, neptune expects that the number of full rounds R_F is an even number, as evidenced by the number of full rounds in the first and second halves being the same R_f = floor(R_F / 2).

All three Poseidon implementations (static, correct, and dynamic) use R_f as the number of first and second half full rounds, which is correct only when R_F is even (currently R_F = 8 for all Filecoin applications). However, when R_F is odd, the number of second half full rounds should be R_F - R_f (so you don't lose the last full round of the second half).

Required Changes

In all three Poseidon implementations, change the number of second half full rounds to self.constants.full_rounds - self.constants.half_full_rounds.
Remove this unimplemented! panic.

The formula is a bit wrong I think,

The formula is a bit wrong I think,
it should be identifier * 2^40 + strength * 2^32.

Originally posted by @Kubuxu in #116 (comment)

create_futhark_context() get 3090 GPU Error

I recently use the 3090 to power my P2. But Got the Error

Graphics SM warp Exception on GPc 1, TPc 0,
Graphics Exception Out of Range Addr

Then some of my P2 task will fail with invalid vinilla proof or CID not match.
the tree_c and tree_r_last are the same!

I use ubuntu 18.04 + GPU driver 460.92

Re-establish a GPU-based CI

We introduced basic CI as part of #177 to compensate for the change in CI infrastructure. We should introduce a Docker runner able to test Neptune CI with OpenCL and CUDA capabilities, and then run it in the form of self-hosted runners.

To achieve this, we need to integrate the following resources:

Existing CI: We currently have a CI pipeline set up with GitHub Actions.
Self-Hosted Runner Documentation: The official GitHub documentation for self-hosted runners can be found here: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners and https://docs.github.com/en/actions/hosting-your-own-runners/using-self-hosted-runners-in-a-workflow
NVIDIA Container Toolkit: In order to use the GPU within our Docker container, we can integrate the NVIDIA Container Toolkit (https://github.com/NVIDIA/nvidia-docker) into our setup.
Dockerfile Example: A sample Dockerfile that demonstrates basic Rust installation: https://gist.github.com/huitseeker/0c58ee69f63c5e81d6ea64f0dc5153f7. This example can serve as a starting point for our own Dockerfile configuration.

BusIdNotAvailable in gbench

cd neptune-master/gbench
export NEPTUNE_GBENCH_GPUS=99
RUST_LOG=info cargo run -- --max-tree-batch-size 700000 --max-column-batch-size 400000 
    Finished dev [unoptimized + debuginfo] target(s) in 0.36s
     Running `/public/home/cf/neptune-master/target/debug/gbench --max-tree-batch-size 700000 --max-column-batch-size 400000
[2021-03-10T09:41:39Z INFO  gbench] KiB: 4194304
[2021-03-10T09:41:39Z INFO  gbench] leaves: 134217728
[2021-03-10T09:41:39Z INFO  gbench] max column batch size: 400000
[2021-03-10T09:41:39Z INFO  gbench] max tree batch size: 700000
[2021-03-10T09:41:39Z INFO  gbench] GPU[Selector: BatcherType::CustomGPU(BusId(99))] --> Run 0
[2021-03-10T09:41:39Z INFO  gbench] GPU[Selector: BatcherType::CustomGPU(BusId(99))]: Creating ColumnTreeBuilder
[2021-03-10T09:41:39Z INFO  neptune::triton::cl] getting context for ~BusId(99)
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ClError(BusIdNotAvailable)', gbench/src/main.rs:31:6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any', gbench/src/main.rs:163:23

BUSID info

$ rocm-smi --showhw
========================ROCm System Management Interface========================

GPU  DID   GFX RAS   SDMA RAS  UMC RAS  VBIOS             BUS           
1    66a1  DISABLED  ENABLED   ENABLED  113-D1631900-064  0000:04:00.0  
2    66a1  DISABLED  ENABLED   ENABLED  113-D1631900-064  0000:26:00.0  
3    66a1  DISABLED  ENABLED   ENABLED  113-D1631900-064  0000:43:00.0  
4    66a1  DISABLED  ENABLED   ENABLED  113-D1631900-064  0000:63:00.0  
==============================End of ROCm SMI Log ==============================

clinfo

[cf@c07r1n01 gbench]$ clinfo 
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.1 AMD-APP (2982.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               4
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    Device 66a1
  Device Topology:                               PCI[ B#4, D#0, F#0 ]
  Max compute units:                             60
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           1600Mhz
  Address bits:                                  64
  Max memory allocation:                         14588628172
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    26273
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            17163091968
  Constant buffer size:                          14588628172
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          1703726284
  Max global variable size:                      14588628172
  Max global variable preferred total size:      17163091968
  Max read/write image args:                     64
  Max on device events:                          1024
  Queue on device max size:                      8388608
  Max on device queues:                          1
  Queue on device preferred size:                262144
  SVM capabilities:                              
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   0x2b835a7f4d30
  Name:                                          gfx906+sram-ecc
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0 
  Driver version:                                2982.0 (HSA1.1,LC)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 2.0 
  Extensions:                                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 


  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    Device 66a1
  Device Topology:                               PCI[ B#38, D#0, F#0 ]
  Max compute units:                             60
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           1600Mhz
  Address bits:                                  64
  Max memory allocation:                         14588628172
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    26273
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            17163091968
  Constant buffer size:                          14588628172
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          1703726284
  Max global variable size:                      14588628172
  Max global variable preferred total size:      17163091968
  Max read/write image args:                     64
  Max on device events:                          1024
  Queue on device max size:                      8388608
  Max on device queues:                          1
  Queue on device preferred size:                262144
  SVM capabilities:                              
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   0x2b835a7f4d30
  Name:                                          gfx906+sram-ecc
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0 
  Driver version:                                2982.0 (HSA1.1,LC)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 2.0 
  Extensions:                                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 


  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    Device 66a1
  Device Topology:                               PCI[ B#67, D#0, F#0 ]
  Max compute units:                             60
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           1600Mhz
  Address bits:                                  64
  Max memory allocation:                         14588628172
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    26273
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            17163091968
  Constant buffer size:                          14588628172
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          1703726284
  Max global variable size:                      14588628172
  Max global variable preferred total size:      17163091968
  Max read/write image args:                     64
  Max on device events:                          1024
  Queue on device max size:                      8388608
  Max on device queues:                          1
  Queue on device preferred size:                262144
  SVM capabilities:                              
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   0x2b835a7f4d30
  Name:                                          gfx906+sram-ecc
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0 
  Driver version:                                2982.0 (HSA1.1,LC)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 2.0 
  Extensions:                                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 


  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    Device 66a1
  Device Topology:                               PCI[ B#99, D#0, F#0 ]
  Max compute units:                             60
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           1600Mhz
  Address bits:                                  64
  Max memory allocation:                         14588628172
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    26273
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            17163091968
  Constant buffer size:                          14588628172
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          1703726284
  Max global variable size:                      14588628172
  Max global variable preferred total size:      17163091968
  Max read/write image args:                     64
  Max on device events:                          1024
  Queue on device max size:                      8388608
  Max on device queues:                          1
  Queue on device preferred size:                262144
  SVM capabilities:                              
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   0x2b835a7f4d30
  Name:                                          gfx906+sram-ecc
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0 
  Driver version:                                2982.0 (HSA1.1,LC)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 2.0 
  Extensions:                                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

Making pasta_curves work

This is a tracking issue in regards to making Neptune work with pasta_curves.

dev needs to be rebased on main

dev at b3dc59e does not include main at a3e9a96

lurk-lab / neptune Goto Github PK

neptune's Introduction

Neptune

About

Implementation Specification

Contributing to the Spec

PDF Rendering Instructions

Ensuring Spec Documents Stay in Sync

Environment variables

Rust feature flags

Arities

Fields

Running the tests

Benchmarking Poseidon by Field and Preimage Length

Sponge API

History

Changes

License

neptune's People

Contributors

Stargazers

Watchers

Forkers

neptune's Issues

Required Changes

Recommend Projects

Recommend Topics

Recommend Org