forceflow / libmorton Goto Github PK

C++ header-only library with methods to efficiently encode/decode Morton codes in/from 2D/3D coordinates

License: MIT License

C++ 95.48% Makefile 0.40% CMake 4.11%

libmorton's Introduction

I'm Jeroen Baert. I write software, mostly for graphics-related applications. I am the developer of libmorton. I'm part of the Nerdland podcast team. I'm most proficient in C/C++, CUDA, Python and Java.

libmorton's People

Contributors

Stargazers

Watchers

Forkers

aavenel tstellanova pyprogrammer gnzlbg hellocomrade daviddoria skaslev sriravic lucabrivio taehongjeong trianglespct nagashayan karinanazare luozhipi victorique-gosick lilleyse wehrdo eglrp kevinhartman happyerik rolaxxxx fei2020 conradhaupt juampamuc peterzs snowfox1939 rkowalewski chris0x44 shivajimedida vastyellownew cloudwow gr4yj3d1 juampatronics russellsch ezhangle bssrdf tzian d3v3l0 zeta1999 jerrycxj mablanchard mustafautku ulises1229 r-luna75 jorgen rebecca128 stomatogod jimbok8 itcastpygit marcos30004347 crystiol grideyes-2010 heroism502 blendergamer chi-w-ng lourenko phyboyzhang uniqlow hengle hl0071 compilerluke qushuaibo rnshah9 tonyskjellum xuyanwen2012 jamestiotio sitrakaresearchandpoc yangzhengshi

libmorton's Issues

morton.h missing

From the README:

You can use the library by including libmorton/morton.h. This will always include the most efficient way to date. The header morton_alternatives.h contains older / alternative implementations, for reference only.

There is no libmorton/morton.h

Morton Offsets

It is possible to quickly (that is, without going through a decode-modify-encode pipeline) compute neighbours of a Morton-encoded coordinate. The technique is mentioned on wikipedia.

That could be useful to add, though I realise the description of this library only mentions encoding and decoding.

[Q] Can a retrieve parent code given child Morton code

How?

Encode and decode mismatch

Hi.

This library is nice, however I've found an issue.

Consider:

template<typename T,typename U>
ostream & operator << (ostream & os, const pair<T,U> &p)
{
    os << "(" << p.first << "," << p.second << ")";
    return os;
}

int main()
{
    uint_fast32_t i = 16, j = 16, x, y;
    morton2D_64_decode(morton2D_64_encode(i, j), x, y);
    cout << make_pair(i,j) << endl << make_pair(x,y)<<endl;
    return 0;
}

Compiling with G++ for C++11, I've got the following:

g++ --std=c++11 morton.cpp

(16,16)
(4,4)

Which is a mismatch between encode and decode.

Thanks.

Morton Order

Is there a way to compare which of two floats/doubles lies further along the Morton curse?
Is there a direct way to do this besides rescaling, converting to integer and doing the conversion?
Something like Moore's Hilbert code function hilbert_ieee_cmp() (here and here).

only cmake versions >=3.15 are supported

Hi everyone,

I noticed with commit b7c5874d68413fe7953befc97597cc6c08c824ab the stated "cmake_minimum_required(VERSION 3.8.2)" in CMakeLists.txt isn't correct anymore.

For cmake versions older than 3.15, you will get the following error:

libmorton-0.2.6/CMakeLists.txt:18 (set_property):
INTERFACE_LIBRARY targets may only have whitelisted properties. The property "PUBLIC_HEADER" is not allowed.

If it's fine to bump up the cmake minimum required version to 3.15, this would be the "easiest" fix.
Another possible workaround for the issue is described in this comment:
https://gitlab.kitware.com/cmake/cmake/-/issues/15234#note_338817

morton codes out of bounds for N = 48

hello,

in the below function I use your library to re-map an input vector of integers in row-major order (corresponding to an NxN array) to a vector in z-order:

std::vector<int32_t> remapToMortonOrder(const std::vector<int32_t>& input, int N) {
    if (input.size() != N * N) {
        throw std::invalid_argument("Morton remap input size does not match the specified dimensions.");
    }

    std::vector<int32_t> output(N * N, 0);
    uint_fast32_t maxIndex = N * N - 1;

    for (int y = 0; y < N; ++y) {
        for (int x = 0; x < N; ++x) {
            int index = y * N + x;
            uint_fast32_t mortonCode = libmorton::morton2D_32_encode(static_cast<uint_fast16_t>(x), static_cast<uint_fast16_t>(y));
            if (mortonCode > maxIndex) {
                std::cerr << "Morton remap error: Morton code out of bounds - " << mortonCode << std::endl;
                throw std::out_of_range("Morton remap code exceeds the bounds of the output array.");
            }
            output[mortonCode] = input[index];
        }
    }
    return output;
}

somehow, this code works for N=2,4,8,16,32,64,128,256, but it does not work for N=48. with N=48, the generated mortonCode exceeds the max index of the output array.

am I using libmorton::morton2D_32_encode wrong?

thanks

A better naming system using tag dispatch

Hi,

You mentioned in TODO that a better naming system is required. I solved the issue by using a combination of tag dispatch, function overloading, and partial specialization of template classes. Please see the implementation in shohirose/morton for details.

A brief overview of my idea is the following:

namespace tag {

// Tags representing each implementation
struct bmi {};
struct preshifted_lookup_table {};
struct lookup_table {};

} // namespace tag

namespace detail {

template <typename MortonCode, typename Coordinate, typename Tag>
class morton2d {};

// Partial specialization of morton2d for tag::bmi
template <typename MortonCode, typename Coordinate>
class morton2d<MortonCode, Coordinate, tag::bmi> {
 public:
  static MortonCode encode(const Coordinate x, const Coordinate y) noexcept {
    // ...
  }
  static void decode(const MortonCode m, Coordinate& x, Coordinate& y) noexcept {
    // ...
  }
};

// Partial specialization of morton2d for tag::preshifted_lookup_table
template <typename MortonCode, typename Coordinate>
class morton2d<MortonCode, Coordinate, tag::preshifted_lookup_table> {
  // ...
};

// Partial specialization of morton2d for tag::lookup_table
template <typename MortonCode, typename Coordinate>
class morton2d<MortonCode, Coordinate, tag::lookup_table> {
  // ...
};

} // namespace detail

// Switch implementation by using a tag dispatch technique.
template <typename Tag>
inline uint_fast32_t encode(const uint_fast16_t x, const uint_fast16_t y, Tag) noexcept {
  return detail::morton2d<uint_fast32_t, uint_fast16_t, Tag>::encode(x, y);
}

template <typename Tag>
inline void decode(const uint_fast32_t m, uint_fast16_t& x, uint_fast16_t& y, Tag) noexcept {
  detail::morton2d<uint_fast32_t, uint_fast16_t, Tag>::decode(x, y);
}

Then, we can specify an implementation through a uniform API:

uint_fast16_t x = // ...
uint_fast16_t y = // ...

// Encode using BMI instruction set (if available)
const auto m1 = encode(x, y, tag::bmi{});
// Encode using preshifted lookup table
const auto m2 = encode(x, y, tag::preshifted_lookup_table{});
// Encode using lookup table
const auto m3 = encode(x, y, tag::lookup_table{});

I hope this might help you.

Early Termination Linux 64-bit version fail on edge cases

Somehow an extra bit gets thrown in here:

x: 0000000000000000000000000000000000000000000111110010001111110001 (2040817)
y: 0000000000000000000000000000000000000000000101001010000110000100 (1352068)
z: 0000000000000000000000000000000000000000000111111000011001111001 (2066041)
morton: 0111101111101101110000011000000100101011011101101101100010000101(8930006396669712517)
x_result: 0000000000000000000000000000000000000000000111110010001111110001 (2040817)
y_result: 0000000000000000000000000000000000000000001101001010000110000100 (3449220)
z_result: 0000000000000000000000000000000000000000100111111000011001111001 (10454649)
64-bit using methods encode LUT Shifted ET and decode LUT ET

Tests for correctness, help request

Dear @Forceflow thank you very much for sharing your work, it is very appreciated.

For my own applications I am developing a library similar to libmorton, but in pure Fortran. I am using a bit interleaving technique (see Stocco 2009) that I guess is similar to your (I have not yet studied your method).

Anyhow, I am looking to your work for validating my implementation (at least for now, when I'll have validated my implementation, I'll try to improve the algorithm performances to match yours, maybe adopting your method) because I am almost sure that my implementation is bugged. Unfortunately, I really have not knowledge of C++ and a very very limited one of C. I tried to read the contents of your test subdirectory, but I cannot understand how you test for correctness of results. Have you hard-coded a list of expected-results into a tests-suite? In particular

Can you give some hints (online-calculator, list of already-validated results, other-validated codes) on how validate my implementation?

I can write my self a list of expected-validated results by means of the Morton definition, but if there are external resources ready to use, I really prefer to use them (for my laziness, but also for safety reasons).

My best regards.

Add encode/decode matching tests

Add tests to see if encoding and decoding result in same values. Should be easy enough!

Assessment of the difficulty in porting CPU architecture for libmorton

Hello everyone! I am working on implementing a tool to assess the complexity of CPU architecture porting. It primarily focuses on RISC-V architecture porting. In fact, the tool may have an average estimate of various architecture porting efforts.My focus is on the overall workload and difficulty of transplantation in the past and future,even if a project has already been ported.As part of my dataset, I have collected the libmorton project. I would like to gather community opinions to support my assessment. I appreciate your help and response! Based on scanning tools, the porting complexity is determined to be simple, with a small amount of code related to the CPU architecture in the project. Is this assessment accurate?Do you have any opinions on personnel allocation and consumption time？ I look forward to your help and response.

Compilation failure in VirtualBox when AVX2 is enabled

In multiple cases now I've seen my users have compilation issues when AVX2 is enabled in a VM but not BMI2.

I understand that you used a check for __AVX2__ as a substitute for __BMI2__ on MSVC, but this breaks on *nix if AVX2 happens to be enabled without BMI2 (e.g. weird CPU flags or VM).

The checks for AVX2 should likely be coupled with a check for _MSC_VER.

pmmp/PHP-Binaries#109

Creating Tables for N-Dimenstion

Hi,

Sorry for opening such and issue. However, I was interested if its possible to generate Morton code for n-dimesnsions using the precomputed tables. That is, generating such tables for LUT. Cheers

PS: I can't seems to find the coding and decoding version of naive for-loop implementation (as discussed in your blogs).

Computation of checkbits could be cleaned up

The lines that compute checkbits (the curve order or bits per dimension) currently use floating-point math, including floor(), which is not a trivial function, depending perhaps on the floating point handling, i.e. /fp:strict.

Here's an example:

libmorton/include/libmorton/morton3D.h

Line 129 in 6d22725

    
           unsigned int checkbits = static_cast<unsigned int>(floor((sizeof(morton) * 8.0f / 3.0f)));

This code can be replaced with the following:
unsigned int checkbits = (sizeof(morton) * 8) / 3;

This uses integer division, includes no function call, and gives the same results in all cases.

The key to this working is the order of operations - doing the multiply first gives an intermediate result of 64 or 32. Dividing this by three yields 21 or 10, as desired.

Getting a Morton ordering of a 3D grid

I have a (arbitrary size) 3D grid that I want to traverse in a Morton order instead of a raster scan order. Can I use these functions to do that? Or does my grid have to be exactly 32x32x32 or something like that?

Problem with m3D_d_sLUT

I attempt to encode the maximum bit of each coordinate, so x = y = z = 2^21 - 1, after that I run decode but I only get the right values of x and y, the value of z always wrong (I also tested with lower value like 2^20, 2^19). So I think maybe we have the problem with that function or did I use it with the wrong way? Anw, could you explain for me more about how decoding work?

Another problem I want to ask is if I want to use morton code with 128 bits (so each x, y and z will have 42 bits), do I have to fixed anything? Because as you said before, your method is divide and conquer, so I think the number of bits is not a problem right?

Thanks for your help!

Incorrect implementation of 2D_magicbits?

From the mailbox, gotta verify this.

Hi,

i looked at your libmorton software and found a bug in the function morton2D_GetSecondBits

where it goes:
morton x = m & masks[4];
x = (x ^ (x >> 2)) & masks[3];
x = (x ^ (x >> 4)) & masks[2];
x = (x ^ (x >> 8)) & masks[1];

but it should be:
morton x = m & masks[5];
x = (x ^ (x >> 1)) & masks[4];
x = (x ^ (x >> 2)) & masks[3];
x = (x ^ (x >> 4)) & masks[2];
x = (x ^ (x >> 8)) & masks[1];

At least that works for me in my c version of your code.

Please tell me, if i am correct about that.

Regards and thanks,

3D Morton for-loop encoding is broken

In morton3D.h, lines 120 and 141, the shift variable should be initialized to 3_i, not 2_i, otherwise some bits are overwriting bits from the previous iterations.

I made the same comment on the original blogpost: http://www.forceflow.be/2013/10/07/morton-encodingdecoding-through-bit-interleaving-implementations/#comment-7508

Thanks for the lib !
Eric

include folder

It would be great if you could put the libmorton folder inside a include folder in the repository. In my project I download and unpack releases (tags) from github. I then set the includepath so that I can do
#include <libmorton/morton.h>
in my include files. However, to achieve this now I have to set the root project folder in the include path, making it possible to include all .gitattributes etc etc. It would therefor be nice if there was another directory that I could put in my includepath.

Supply same uint types overloads or templates

Somewhat related to #41

Having to manage both uint_fast32_t / uint_fast16_t or uint_fast64_t / uint_fast32_t is not easy, especially with templated code:

The following overload:
inline void morton2D_32_decode(const uint_fast32_t morton, uint_fast16_t& x, uint_fast16_t& y);

Should also offer the following overload by casting:
inline void morton2D_32_decode(const uint_fast32_t morton, uint_fast32_t & x, uint_fast32_t & y);

This brings up the fact that it could use a templated implementation to allow that kind of overloads:

template<typename T, typename U>
inline void morton2D_32_decode(const T morton, U& x, U& y);

Implementation should static_assert that both T and U are unsigned and optionnaly that there are enough std::numeric_limits<>::digits between each other.

Compilation failing on Mac

My configuration:
MacOS Mojave 10.14.6Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Errors I am seeing:
template<typename...T>
^
/Users/kiran/libmorton/libmorton/test/libmorton_test.h:98:107: error: expected expression
return control_encode_impl<sizeof...(fields)>(std::numeric_limits<uint64_t>::digits / sizeof...(fields), { fields... });
^
/Users/kiran/libmorton/libmorton/test/libmorton_test.h:102:118: error: a space is required between consecutive right angle brackets
(use '> >')
void control_decode_impl(uint64_t encoding, std::size_t bitCount, const std::valarray<std::reference_wrapper<uint64_t>>& fields) {

Stubs for 2D decoding still refer to for-loop method

Even though you implemented faster ones.

Interpreting output of morton3D_64_encode

Here is a demo that is just a loop over a 3x3 cube. I guess I expected the values 0-26 to come back in some shuffled version, but instead I got:

0 1 8 2 3 10 16 17 24 4 5 12 6 7 14 20 21 28 32 33 40 34 35 42 48 49 56


#include <iostream>

#include "include/morton.h"

int main()
{
    for(unsigned int z = 0; z < 3; ++z)
    {
        for(unsigned int y = 0; y < 3; ++y)
        {
            for(unsigned int x = 0; x < 3; ++x)
            {
                int index = morton3D_64_encode(x,y,z);
                std::cout << "index: " << index << std::endl;
            }
        }
    }

  return 0;
}

Can you explain how to interpret these indices?

Thanks!

David

Use BMI2 pdep/pext instructions

BMI2 provides parallel bit deposit/extract instructions that allow an efficient encoding/decoding of morton codes. Something like the following should do the trick. Basically in 3D one needs 3 calls to pdep/pext intrinsics to encode/decode morton codes.

#include <array>
#include <type_traits>
#if defined(__BMI2__)
#include <immintrin.h>
#endif

// wrappers around the bmi2 pdep/pext intrinsics
namespace bmi2_detail {

constexpr uint32_t pdep(uint32_t source, uint32_t mask) noexcept {
  return _pdep_u32(source, mask);
}
constexpr uint64_t pdep(uint64_t source, uint64_t mask) noexcept {
  return _pdep_u64(source, mask);
}

constexpr uint32_t pext(uint32_t source, uint32_t mask) noexcept {
  return _pext_u32(source, mask);
}
constexpr uint64_t pext(uint64_t source, uint64_t mask) noexcept {
  return _pext_u64(source, mask);
}

}  // namespace bmi2_detail

/// Parallel Bits Deposit implementation w/o BMI2 intrinsics
template <typename Integral>
__attribute__((no_sanitize("integer"))) 
constexpr Integral deposit_bits(Integral x, Integral mask) {
#if !defined(__BMI2__)
  Integral res = 0;
  for (Integral bb = 1; mask != 0; bb += bb) {
    if (x & bb) { res |= mask & (-mask); }
    mask &= (mask - 1);
  }
  return res;
#else
  return bmi2_detail::pdep(x, mask);
#endif
}

/// Parallel Bits Extract implementation w/o BMI2 intrinsics
template <typename Integral>
__attribute__((no_sanitize("integer"))) 
constexpr Integral extract_bits(Integral x, Integral mask) {
#if !defined(__BMI2__)
  Integral res = 0;
  for (Integral bb = 1; mask != 0; bb += bb) {
    if (x & mask & -mask) { res |= bb; }
    mask &= (mask - 1);
  }
  return res;
#else
  return bmi2_detail::pext(x, mask);
#endif
}

// restrict to unsigned integer types
template <typename T>
using enable_if_unsigned_t = std::enable_if_t<std::is_unsigned<T>{}>;

template <typename UInt, typename = enable_if_unsigned_t<UInt>>
constexpr UInt encode(std::array<UInt, 1> xs) noexcept {
  return xs[0];
}
template <typename UInt, typename = enable_if_unsigned_t<UInt>>
constexpr std::array<UInt, 1> decode(UInt code, std::array<UInt, 1>) noexcept {
  // I use std::array for decoding but using anything else should be trivial
  return {{code}};
}

template <typename UInt, typename = enable_if_unsigned_t<UInt>> 
UInt encode(std::array<UInt, 2> xs) noexcept {
  return deposit_bits(xs[1], static_cast<UInt>(0xAAAAAAAAAAAAAAAA))
         | deposit_bits(xs[0], static_cast<UInt>(0x5555555555555555));
}
template <typename UInt,  typename = enable_if_unsigned_t<UInt>>
std::array<UInt, 2> decode(UInt code, std::array<UInt, 2>) noexcept {
  return {{extract_bits(code, static_cast<UInt>(0x555555555555555)),
           extract_bits(code, static_cast<UInt>(0xAAAAAAAAAAAAAAAA))}};
}

template <typename UInt, typename = enable_if_unsigned_t<UInt>>
UInt encode(std::array<UInt, 3> xs) noexcept {
  return deposit_bits(xs[2], static_cast<UInt>(0x4924924924924924))
         | deposit_bits(xs[1], static_cast<UInt>(0x2492492492492492))
         | deposit_bits(xs[0], static_cast<UInt>(0x9249249249249249));
}
template <typename UInt, enable_if_unsigned_t<UInt>>
std::array<UInt, 3> decode(UInt code, std::array<UInt, 3>) noexcept {
  return {{extract_bits(code, static_cast<UInt>(0x9249249249249249)),
           extract_bits(code, static_cast<UInt>(0x2492492492492492)),
           extract_bits(code, static_cast<UInt>(0x4924924924924924))}};
}

findFirstSetBitZeroIdx() does not work correctly for small morton code

The function findFirstSetBitZeroIdx() does not work correctly for small (<= 4byte) morton values when compiling on 64-bit Linux using gcc or clang.

const uint32_t x = 545658634;
unsigned long c;
findFirstSetBitZeroIdx(x, c); // Expected: c == 29

The issue is that internally, the builtin function __builtin_clzll is used for all inputs, i.e. the input variable x is cast to unsigned long long. For the ID above, __builtin_clzll returns 34, which is correct if x were a 64-bit integer.
But that result is substracted from sizeof(morton) * 8, which in this case is 32.
In the end we have 32 - 34 -1, cast to unsigned long, which is an integer overflow.

Possible fix:

*firstbit_location = static_cast<unsigned long>((sizeof(unsigned long long) * 8) - __builtin_clzll(x) - 1);

GCC: Wrong builtin used

Use clz instead of ffs, this is just plain wrong.

AVX2 acceleration is only compiling with MSVC build system

When just using the main header file morton.h the preprocessor directives disallow compilation of AVX2 accelerated functions when using any compiler other than MSVC. See these lines:

#if defined(__BMI2__) || (defined(__AVX2__) && defined(_MSC_VER))
#include "morton_BMI.h"
#elif defined(__AVX512BITALG__)
#include "morton_AVX512BITALG.h"
#endif

I don't see any apparent reason for this but the effect is that building is prevented on most compiler systems.
I would appreciate removal of this requirement if it isn't there on purpose as my current solution is a duplicated morton.h without the directives which is neither beautiful nor a good solution.

Thanks for your support!

Cuda version

Dear Jeroen,

I modified this library to work with Cuda for my thesis (also made it a single file). I am attaching it in case you are interested. Keep in mind it's not well tested yet (I'm still in the process of designing my linear octree).

Best Regards,
Ali

morton.zip

Current libmorton test timing is off

Need to unroll loops, as demonstrated in the comment on http://www.forceflow.be/2013/10/07/morton-encodingdecoding-through-bit-interleaving-implementations/#comment-22446

Problem with round trip between m3D_e_for(...) and m3D_d_for(..)

While studying your implementation using the simple for loop in the file morton3D.h I noticed that on line 129 of m3D_e_for the loop is implemented as
for (unsigned int i = 0; i < checkbits; ++i) {

while the corresponding m#D_e_for on line 253 is implemented as
for (unsigned int i = 0; i <= checkbits; ++i) {

while I haven't run your code I can confirm that when doing a port of the functions into a C# project I needed to change both loops to use i < checkbits before I could get my round trip tests to pass.

Install support in CMake scripts?

The libmorton CMake scripts currently to not have installation directives. Would be nice to allow installing the headers to a system directory (possibly respecting the GNUInstallDirs CMAKE_INSTALL_INCLUDEDIR standard variable?). This would simplify integration of the library in complex software stacks where build is automated.

A _mm512_bitshuffle_epi64_mask method (vpshufbitqmb, AVX512_BITALG)

In anticipation of Icelake bringing AVX512 to consumer hardware, I've come up with an alternative to the pdep method for encoding a 32-bit 2D coordinate into a 64-bit morton code which utilizes the vpshufbitqmb instruction of the BITALG subset of AVX512.
vpshufbitqmb allows building a new value bit by bit by indexing into a 64-bit lane(into an AVX512 k-register mask).

It relies on interpreting two 32-bit X and Y coordinates as a continuous 64-bit value though. It can support much higher dimensions provided you zip(as in, feed in and interleave the leading bytes or so of the coordinates vector elements into each 64-bit lane before selecting bits) certain bytes of the input coordinates.

There is no throughput or latency data at the moment for this instruction or any hardware to actually test this upon but the assembly looks very promising. Especially for any bulk-processing where all the constants are loaded outside of the loop.

An illustrative example code of the two methods:

#pragma pack(push, 1)
union Coord2D
{
	std::uint32_t u32[2]; // x, y
	std::uint64_t u64;
};
#pragma pack(pop)

std::uint64_t Morton2DEncode( const Coord2D& coord )
{
	const __m512i CoordVec = _mm512_set1_epi64(coord.u64);
	// Interleave bits from 32-bit X and Y coordinate
	const __mmask64 Interleave = _mm512_bitshuffle_epi64_mask(
		CoordVec,+
		_mm512_set_epi16(
			0x20'00 + 0x01'01 * 31, 0x20'00 + 0x01'01 * 30,
			0x20'00 + 0x01'01 * 29, 0x20'00 + 0x01'01 * 28,
			0x20'00 + 0x01'01 * 27, 0x20'00 + 0x01'01 * 26,
			0x20'00 + 0x01'01 * 25, 0x20'00 + 0x01'01 * 24,
			0x20'00 + 0x01'01 * 23, 0x20'00 + 0x01'01 * 22,
			0x20'00 + 0x01'01 * 21, 0x20'00 + 0x01'01 * 20,
			0x20'00 + 0x01'01 * 19, 0x20'00 + 0x01'01 * 18,
			0x20'00 + 0x01'01 * 17, 0x20'00 + 0x01'01 * 16,
			0x20'00 + 0x01'01 * 15, 0x20'00 + 0x01'01 * 14,
			0x20'00 + 0x01'01 * 13, 0x20'00 + 0x01'01 * 12,
			0x20'00 + 0x01'01 * 11, 0x20'00 + 0x01'01 * 10,
			0x20'00 + 0x01'01 *  9, 0x20'00 + 0x01'01 *  8,
			0x20'00 + 0x01'01 *  7, 0x20'00 + 0x01'01 *  6,
			0x20'00 + 0x01'01 *  5, 0x20'00 + 0x01'01 *  4,
			0x20'00 + 0x01'01 *  3, 0x20'00 + 0x01'01 *  2,
			0x20'00 + 0x01'01 *  1, 0x20'00 + 0x01'01 *  0
		)
	);
	return _cvtmask64_u64(Interleave);
}

std::uint64_t m2D_e_BMI(const Coord2D& coord)
{
	return _pdep_u64(static_cast<std::uint64_t>(coord.u32[0]), 0x5555555555555555)
		| _pdep_u64(static_cast<std::uint64_t>(coord.u32[1]), 0xAAAAAAAAAAAAAAAA);
}

The generated assembly using gcc 9.1 with -Ofast -march=icelake-client:

Morton2DEncode(Coord2D const&):
        vpbroadcastq    zmm0, QWORD PTR [rdi]
        vpshufbitqmb    k0, zmm0, ZMMWORD PTR .LC0[rip]
        kmovq   rax, k0
        vzeroupper
        ret
m2D_e_BMI(Coord2D const&):
        mov     eax, DWORD PTR [rdi]
        movabs  rdx, 6148914691236517205
        pdep    rax, rax, rdx
        mov     edx, DWORD PTR [rdi+4]
        movabs  rcx, -6148914691236517206
        pdep    rdx, rdx, rcx
        or      rax, rdx
        ret

I have verified the implementation with Intel's emulator and I'd love to update with some actual benchmarks once I get some Icelake hardware in my hands.

What do you think?

Comparing with mmcloughlin optimizations?

Hi, is there a benchmark comparing libmorton with this Golang optimizations?
https://mmcloughlin.com/posts/geohash-assembly (soruce).

Early termination LUT versions are incorrect on 32-bit compile

Early termination decoding stops too early and misses 8 bits.
Only happens on 32-bit compilation.

x: 00000000000000000000000000101001 (41)
y: 00000000000000000100100000100011 (18467)
z: 00000000000000000001100010111110 (6334)
morton: 0000000000000000000010000100110000000000100000111100101100110011(9122519173939)
x_result: 00000000000000000000000000101001 (41)
y_result: 00000000000000000000000000100011 (35)
z_result: 00000000000000000000000000111110 (62)
64-bit using methods encode LUT Shifted ET and decode LUT Shifted ET

Add name and year to license copyright

Thanks for the great library!

Just a pedantic little nitpick: The license file currently has the default "Copyright (c) [year] [fullname]"

Is the magicbit version correct?

I have a morton code library in rust that just uses BMI2 and wanted to use magicbits when BMI2 was not available. For testing, I used an invariant of the form:

fn invariant<T: Int>(v: T) {
  { // 2D
    let (x, y) = morton_decode_2d(v);
    let vs = morton_encode_2d(x, y);
    assert_eq!(v, vs);
  }
  { // 3D
    let (x, y, z) = morton_decode_3d(v);
    let vs = morton_encode_3d(x, y, z);
    assert_eq!(v, vs);
  }
}

for many integer ranges. I am testing for all u8, u16, [u16::max as u32, u16::max as u32 - 1000000], [u32::max - 1000000, u32::max], [u32::max as u64, u32::max as u64 - 1000000], [u64::max - 1000000, u64::max].

The BMI2 version always passes all tests correctly so it must be at least be a bug with my implementation.

I needed to change your magicbits implementation for the 2D case to:

// 2D decode 32-bit:
            x = x & MORTON_MASK2D_U32[5];
            x = (x ^ (x >> 1)) & MORTON_MASK2D_U32[4];
            x = (x ^ (x >> 2)) & MORTON_MASK2D_U32[3];
            x = (x ^ (x >> 4)) & MORTON_MASK2D_U32[2];
            x = (x ^ (x >> 8)) & MORTON_MASK2D_U32[1];
            x = (x ^ (x >> 16)) & MORTON_MASK2D_U32[0];
// 2D encode 32-bit:
            x = (x | x << 16) & MORTON_MASK2D_U32[1];
            x = (x | x << 8) & MORTON_MASK2D_U32[2];
            x = (x | x << 4) & MORTON_MASK2D_U32[3];
            x = (x | x << 2) & MORTON_MASK2D_U32[4];
            x = (x | x << 1) & MORTON_MASK2D_U32[5];
// 2D decode 64-bit:
            x = x & MORTON_MASK2D_U64[5];
            x = (x ^ (x >> 1)) & MORTON_MASK2D_U64[4];
            x = (x ^ (x >> 2)) & MORTON_MASK2D_U64[3];
            x = (x ^ (x >> 4)) & MORTON_MASK2D_U64[2];
            x = (x ^ (x >> 8)) & MORTON_MASK2D_U64[1];
            x = (x ^ (x >> 16)) & MORTON_MASK2D_U64[0];
// 2D encode 64-bit:
            x = (x | x << 32) & MORTON_MASK2D_U64[0];
            x = (x | x << 16) & MORTON_MASK2D_U64[1];
            x = (x | x << 8) & MORTON_MASK2D_U64[2];
            x = (x | x << 4) & MORTON_MASK2D_U64[3];
            x = (x | x << 2) & MORTON_MASK2D_U64[4];
            x = (x | x << 1) & MORTON_MASK2D_U64[5];

to make it pass this test. I am still having trouble with the 3D version though...

Just wanted to ping you on the issue that the magic bits implementation might not be 100% correct. It seems to work for small enough morton codes, but if you only need those you should probably be using u8 or u16 specific implementations.

A good test is just to test it against the BMI2 version for the full u8, u16, and u32 integer ranges, and for some subranges of the u64 integer range (testing the whole u64 range takes N * 100 years on a 3Ghz CPU).

If it turns out that this is not a bug in my code, but actually a problem with libmorton, you might want to repeat the benchmarks since if more operations are needed for correctness maybe BMI2 will compare better.

morton3D_Encode_LUT256_shifted_ET 'total' was not defined

At line 74 in morton3D.h there is:
return total | answer_z | answer_y | answer_x;

but 'total' doesn't seem to be defined anywhere in this scope.

declaration of 'const morton morton' shadows template parameter

In morton2D.h at line 98 there is:

template<typename morton, typename coord>
inline void morton2D_Decode_for(const morton morton, coord& x, coord& y) {

The first argument cannot be named 'morton' because the template parameter is already named 'morton'.

MIT/BSD/ZLIB/something dual license?

Hey, I was very interested in using this in a small open source 2D/3D engine to speed up collision detection, but I unfortunately am targeting a BSD style license. Is there any way we could get a dual license version of this?

Thank you!