Giter Site home page Giter Site logo

zp7's Introduction

ZP7 (Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill)

This is a fast branchless replacement for the PEXT/PDEP instructions. If you don't know what those are, this probably won't make much sense.

These instructions are very fast on Intel chips that support them (3 cycles latency), but have a much slower implementation on AMD. This code will be much slower than the native instructions on Intel chips, but faster on AMD chips, and generally faster than a naive loop (for all but the most trivial cases).

A detailed description of the algorithm used is in zp7.c.

Usage

This is distributed as a single C file, zp7.c. These two functions are drop-in replacements for _pext_u64 and _pdep_u64:

uint64_t zp7_pext_64(uint64_t a, uint64_t mask);
uint64_t zp7_pdep_64(uint64_t a, uint64_t mask);

There are also variants for precomputed masks, in case the same mask is used across multiple calls (whether for PEXT or PDEP--the masks are the same for both). In this case, a zp7_masks_64_t struct is created from the input mask using the zp7_ppp_64 function, and passed to the zp7_*_pre_64 variants:

zp7_masks_64_t zp7_ppp_64(uint64_t mask);
uint64_t zp7_pext_pre_64(uint64_t a, const zp7_masks_64_t *masks);
uint64_t zp7_pdep_pre_64(uint64_t a, const zp7_masks_64_t *masks);

Three #defines can change the instructions used, depending on the target CPU, as listed below. If none of these symbols are defined, the code should portable to any architecture.

  • HAS_CLMUL: whether the processor has the CLMUL instruction set, which is on most x86 CPUs since ~2010. Using CLMUL gives a fairly significant speedup and code size reduction.
  • HAS_BZHI: whether the processor has BZHI, which was introduced in the same BMI2 instructions as PEXT/PDEP. This is only used once for PDEP, so it matters much less than CLMUL.
  • HAS_POPCNT: whether the processor has POPCNT, which was introduced in SSE4a/SSE4.2. Like BZHI, this is only used once for PDEP, but matters more for speed, as the software POPCNT is several instructions.

This code is hardcoded to operate on 64 bits. It could easily be adapted for 32 bits by changing N_BITS to 5, replacing uint64_t with uint32_t, and modifying the popcount/bzhi intrinsics/polyfills. This will be slightly faster and will save some memory for pre-calculated masks, but is left out for simplicity. Smaller inputs could likewise be supported by similar modifications.

There are also a couple optimizations that could be made for precomputed masks for PDEP: the POPCNT/BZHI combination, as well as six shifts, depend only on the mask, and could be precomputed. I've left this out for now in the interest of simplicity and allowing precomputed masks to be shared between PEXT and PDEP.

zp7's People

Contributors

zwegner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zp7's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.