gvinciguerra / pgm-index Goto Github PK

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes

Home Page: https://pgm.di.unipi.it/

License: Apache License 2.0

CMake 0.13% C++ 99.42% C 0.20% Jupyter Notebook 0.24%

data-structures database cpp research indexing big-data compression succinct-data-structure b-tree machine-learning

pgm-index's People

Contributors

Stargazers

Watchers

Forkers

amallia leiloong zhouyonglong ryanmarcus ngaut stoianmihail liuguanli alexandervanrenen hailanwhu msetzu volatill lockskyarn junyuewang3 zhengtongyan wapxmas samsplunks glycerine dotkt lucassouzavieira chadbrewbaker neodon save-buffer em3ndez zeta1999 tomatolog zvandehy anmult ggansik ybtsdst rkolla99 umatin chaohcc bwry mentallycramped lulv mrsndmn siriussota romaa2000 danilabagroff ishine shiyunya tubbz-alt zhengguohuang tawnysky illinoisdata yf-chai luffy06 ajunlonglive laplacekorea zhujunhao7488 dominikhorn zhjin3 ss5hane11 geieredgar baajarmeh jzx-bitdb manticoresoftware juziii npalff jinhoko gshanemiller gerhobbelt yang-x-y whitebearwyl drewdzzz why520it lgtm-migrator gg-big-org cagev quieoo tong1heng qianyich droidkid shihwenhsu joecarf kkould bftjoe faizol amdahl-rs jackyucn guokelei1 saltsystemslab yangzq50 jrcg0000 linekm zhongqi2016 maqamylee0

pgm-index's Issues

noexcept

Please make an option to make this not throw exceptions. Throwing exceptions forces changes to the code or to enable exceptions which can impact performance.

Logic error: Points must be increasing by x.

Hello! Got another one for you :)

Dataset: https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7

PGMIndex<uint64_t, 16, 4> and PGMIndex<uint64_t, 32, 4> both build without an issue, but PGMIndex<uint64_t, 64, 4> fails:

terminate called after throwing an instance of 'std::logic_error'                                                  
  what():  Points must be increasing by x.

Backtrace:

(gdb) bt                                                                                                           
#0  0x00007ffff7a87ce5 in raise () from /usr/lib/libc.so.6                                                         
#1  0x00007ffff7a71857 in abort () from /usr/lib/libc.so.6                                                         
#2  0x00007ffff7e2c81d in __gnu_cxx::__verbose_terminate_handler ()                                                
    at /build/gcc/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95                                                  
#3  0x00007ffff7e392ca in __cxxabiv1::__terminate (handler=<optimized out>)                                        
    at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48                                                
#4  0x00007ffff7e39337 in std::terminate () at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58        
#5  0x00007ffff7e3959e in __cxxabiv1::__cxa_throw (obj=obj@entry=0x555555b29420, 
    tinfo=0x555555b04098 <typeinfo for std::logic_error@@GLIBCXX_3.4>, 
    dest=0x7ffff7e4f630 <std::logic_error::~logic_error()>)
    at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x00005555557ad18e in OptimalPiecewiseLinearModel<unsigned long, unsigned long>::add_point (
    this=this@entry=0x7fffffffd9b0, x=@0x7fffffffd760: 0, y=@0x7fffffffd768: 8942)
    at /usr/include/c++/9.3.0/bits/stl_iterator.h:806
#7  0x00005555557bdeeb in make_segmentation<PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1, auto:2, auto:3)#3}>(unsigned long, unsigned long, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}, PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#4}) (n=n@entry=8943, error=error@entry=4, in=..., out=..., out@entry=...)
    at /usr/include/c++/9.3.0/bits/stl_pair.h:378
#8  0x00005555557bee24 in PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > > (last=0, first=..., 
    this=0x555555b2dea0) at /usr/include/c++/9.3.0/bits/stl_vector.h:1040
#9  PGMIndex<unsigned long, 64ul, 4ul, double>::PGMIndex (
    data=std::vector of length 200000000, capacity 268435456 = {...}, this=0x555555b2dea0)
    at /home/ryan/SOSD-private/competitors/PGM-index/include/pgm_index.hpp:113
#10 std::make_unique<PGMIndex<unsigned long, 64ul, 4ul, double>, std::vector<unsigned long, std::allocator<unsigned
 long> >&> () at /usr/include/c++/9.3.0/bits/unique_ptr.h:857
#11 PGM<unsigned long, 64>::Build (data=..., this=0x7fffffffdc30)
    at /home/ryan/SOSD-private/dtl/../competitors/pgm_index.h:23

Weblogs and IoT data set

Dear All,

are the Weblogs and IoT datasets still publicly available?

The links in the references [22] and [17] are broken as the repository https://github.com/BrownBigData/MgBench was removed.

Thanks for your help,

Max

Postgres?

This sounds exciting. Is it going to be implemented in a DB like Postgres soon?

Clarification on Error Bound Ranges for Present and Absent Keys

Hello there! I want to express my admiration for the excellent thesis you've put together. Your work is truly insightful and contributes greatly to the field. Also, I appreciate the kind and helpful response you provided to my previous question. I hope you won't mind if I ask for your guidance once more.

I've been looking into the code and noticed something interesting. When a key is present, the error bound seems to be defined as [p-e, p+e+2). On the other hand, if the key isn't present, the error bound appears to be guaranteed as [p-e, p+e+3).

You mentioned in your comments that the range is guaranteed to contain a key that is not less than lookup key if key is not present. However, I was wondering if it might actually guaranteed to contain a key that is less than lookup key.

I hope you don't mind me seeking some clarification on this topic. Would it be more accurate to define the error bound range that includes the start and excludes the end as [p-e, p+e+3) regardless of whether the key exists or not? I'd really appreciate your thoughts on this.

Additionally, I've included a modified version of your example code below to help illustrate my question.

I hope you'll find my observations helpful, and if my findings do turn out to be accurate, would you be open to receiving a pull request for your repository? I'd be more than happy to contribute in a way that respects your work and intentions. Please let me know if this would be acceptable to you, and once again, thank you for your valuable insights and assistance.

[Test Code]

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <random>
#include <unordered_set>
#include "pgm/pgm_index.hpp"

template <class KeyType>
std::vector<KeyType> CreateUniqueRandomKeys(size_t seed, size_t num_keys) {
  std::unordered_set<KeyType> keys;
  keys.reserve(num_keys);
  std::mt19937 g(seed);
  std::uniform_int_distribution<KeyType> d(std::numeric_limits<KeyType>::min(),
                                           std::numeric_limits<KeyType>::max());
  while (keys.size() < num_keys) keys.insert(d(g));
  std::vector<KeyType> sorted_keys(keys.begin(), keys.end());
  std::sort(sorted_keys.begin(), sorted_keys.end());
  return sorted_keys;
}

int main() {
    const auto keys = CreateUniqueRandomKeys<uint64_t>(42, 1000);
    const auto lookup_keys = CreateUniqueRandomKeys<uint64_t>(815, 1000);

    // Construct the PGM-index
    const int epsilon = 4; // space-time trade-off parameter
    pgm::PGMIndex<uint64_t, epsilon> index(keys);

    for (const auto& key : lookup_keys) {
        pgm::ApproxPos bound = index.search(key);
        const auto pgm_it = std::lower_bound(keys.begin() + bound.lo, keys.begin() + bound.hi, key);
        const auto std_it = std::lower_bound(keys.begin(), keys.end(), key);

        // Check if key is not present.
        if (*std_it != key){
            // Check if lowerbound of lookup key is p+e+2.
            if ((pgm_it == keys.begin() + bound.hi) && (std_it != keys.end())){
                std::cout << "[p+e+1 key]  value: " << *(keys.begin() + bound.hi - 1) << " lower_bound pos: " << bound.hi - 1 << std::endl;
                std::cout << "[lookup key] value: " << key << " lower_bound pos: " << std_it - keys.begin()<< std::endl;
                std::cout << "[p+e+2 key]  value: " << *(keys.begin() + bound.hi) << " lower_bound pos: " << bound.hi<< std::endl << std::endl;
            }
        }
    }
    return 0;
}

[Result]

g++ examples/simple.cpp -std=c++17 -I./include -o simple -fopenmp
./simple
[p+e+1 key]  value: 8746963316584524687 lower_bound pos: 484
[lookup key] value: 8747272492531411863 lower_bound pos: 485
[p+e+2 key]  value: 8781579887036780598 lower_bound pos: 485

[p+e+1 key]  value: 9697272242467760156 lower_bound pos: 535
[lookup key] value: 9699555025101314499 lower_bound pos: 536
[p+e+2 key]  value: 9734367273811274964 lower_bound pos: 536

multi-level PGM-index and its size

@gvinciguerra
Hello, I have done a experiment to build multi-level PGM-index.
Why? I use it to support very large integer, like a 256-bit long integer.
Assume I have a uint256_t num, I set num[0] is a uint128_t with high 128 bits of num, num[1] is also a uint128_t with low 128 bits of num.
To index with PGM-index

typedef PGM-index<uint128_t, value> type1;
typedef PGM-index<uint128_t, type1 *> type2;

for insert.
I will first insert a type1 variable into type2, then insert the real value to the corresponding type1, such as:

type2 index2nd;
auto pp = index2nd.find(num[0]);
if(pp == index2nd.end()){
    type1 *pp2 = new type1();
    index2nd.insert_or_assign(num[0], pp2);
    pp2->insert_or_assign(num[1], value);
}
else{
    pp.second->insert_or_assign(num[1], value);
}

To calculate the size of the whole index:

uint64_t size = index2nd.size_in_bytes();
for (auto &e : index2nd) {
      size += e.second->size_in_bytes();
}

Does size correctly calculate the memory size of index2nd?

MacOS aarch64 compile issue

Apple clang version 12.0.0 (clang-1200.0.32.28)

Guessing this is an X86 dependency that should be #ifdef guarded? I'll take a look tomorrow.

PGM-index/include/pgm/piecewise_linear_model.hpp:27:9: warning: Compilation with -fopenmp is optional but recommended [-W#pragma-messages]
#pragma message ("Compilation with -fopenmp is optional but recommended")
^
PGM-index/tuner/tuner.cpp:25:
PGM-index/tuner/tuner.hpp:176:34: error: invalid output constraint '=a' in asm
asm volatile("cpuid":"=a"(), "=b"(), "=c"(c), "=d"(_):"a"(0x80000006));
^
1 warning and 1 error generated.

Questions about Error Bound

I was very impressed with your paper.
I don't know if I understood the research paper well, but I ask because error bounds are quite different with what I understand.

What is inclusive search error bound for positive search?
I thought search error bound for positive search is [p-e, p+e], but it seems [p-e, p+e+1].
Therefore it seems bottom segments returns search range satisfying "key_p-e <= positive lookup_key <= key_p+e+1".

According your code, the search range for bottom segment is [p-e, p+e+2) and search range for upper(not bottom) segment is (p-e-1, p+e+2). Is it because pgm-index have to select a segment with a key less than the lookup key in upper level and optimal-plr returns search range that is not less than unpresent key?

What is search error bound for negative-search(searching missing key) to get same result of std::lower_bound()?

In the middle, I accidentally uploaded the question before I finished writing it.
Sorry for so many questions, but I really want to understand your research well.
Thank you.

conflicts between sdsl and CompressPGM

Hello, I encounter an engineering issue.

That is the version conflicts between sdsl-lite and EliasFanoPGMIndex in pgm_index_variants.hpp.

==============error====================
error: ‘const rank_1_type’ {aka ‘const class sdsl::rank_support_sd<1, sdsl::int_vector<1>, sdsl::select_support_mcl<1, 1>, sdsl::select_support_mcl<0, 1> >’} has no member named ‘pred’
  566 |         auto[r, origin] = rank.pred(k - first_key);
=====================================

I guess it is caused because current sdsl-lite does not support rank.pred.

../PGM-index/include/pgm/pgm_index_variants.hpp:570:28: error: could not convert ‘{pos, lo, hi}’ from ‘<brace-enclosed initializer list>’ to ‘pgm::ApproxPos’
  570 |         return {pos, lo, hi};

I compile sdsl-lite and the example code via std=c++17, but it still appears (very strange).

Scalability issue

Hello,@gvinciguerra. I have done a comparison betweem PGM-index with unordered_map (hash map) by using

pgm::DynamicPGMIndex<__uint128_t, A> d_index; // A is a structure with multiple uint64_t
for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
	A a;
	a.a = (uint64_t)i;
	a.b = (uint64_t)2*i;
	a.c = (uint64_t)3*i;
	d_index.insert_or_assign(i, a);
}
A b;
A *c = &b;
for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
	auto iter = d_index.find(i);
	*c = iter->second;
	std::cout << b.a << " " << b.b << " " << b.c << std::endl;
}

The memory consumption of DynamicPGMIndex can achieve 98% and it will cause:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

while the case using unordered_map only has very low memory consumption.
Theoretical speaking, Hash map should use much more memory (pointers in linked list, RB tree etc)
Why this experiment show DynamicPGMIndex uses more memory? And how we solve it?

Another issue is mentioned in #17 How to achieve dynamic insertable set by using PGM index?

Many thanks

Lo > high and a few more valgrind errors

Hello, got another one for you to squash. :)

I was trying to run the PGM index on 32-bit synthetic lognormal data, and encountered a scenerio where (after some UB), the PGM index predicts a lower bound higher than the upper bound.

Dataset download: http://cs.brandeis.edu/~rcmarcus/lognormal_200M_uint32.xz

Again, the dataset format is a 64-bit unsigned integer with the number of entries (200M), and then the data (as 32-bit unsigned integers).

Here's the driver program I used (same as last time, but with uint32_t).

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <iterator>
#include "pgm_index.hpp"

static std::vector<uint32_t> load_data(const std::string& filename) {
  std::vector<uint32_t> data;

  std::ifstream in(filename, std::ios::binary);
  if (!in.is_open()) {
      std::cerr << "unable to open " << filename << std::endl;
      exit(EXIT_FAILURE);
  }
  
  // Read size.
  uint64_t size;
  in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
  data.resize(size);
  
  // Read values.
  in.read(reinterpret_cast<char*>(data.data()), size*sizeof(uint32_t));
  in.close();

  return data;
}


int main(int argc, char **argv) {
    // Generate some random data
    std::vector<uint32_t> dataset = load_data("lognormal_200M_uint32");
    std::cout << "Dataset is loaded." << std::endl;

    // Construct the PGM-index
    const int error = 128;
    PGMIndex<uint32_t, error> index(dataset);

    std::cout << "PGM index constructed." << std::endl;
    
    // Query the PGM-index
    uint32_t q = 1859517629;
    auto approx_range = index.find_approximate_position(q);
    std::cout << approx_range.lo << " to " << approx_range.hi << "\n";
        
    auto lo = dataset.begin() + approx_range.lo;
    auto hi = dataset.begin() + approx_range.hi;

    auto pos = std::distance(dataset.begin(), std::lower_bound(lo, hi, q));
    std::cout << "search returned:  " << pos << "\n";

    auto corr_pos = std::distance(dataset.begin(), std::lower_bound(dataset.begin(), dataset.end(), q));
    std::cout << "correct position: " << corr_pos << "\n";

    return 0;
}

The resulting output:

Dataset is loaded.
PGM index constructed.
2173840071 to 199999999
search returned:  2173840071
correct position: 199999901

... and the valgrind output:

==22833== Memcheck, a memory error detector
==22833== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==22833== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==22833== Command: ./pgm
==22833== 
==22833== Warning: set address range perms: large range [0x519c040, 0x34c8c840) (undefined)
==22833== Warning: set address range perms: large range [0x519e037, 0x34c8c840) (defined)
Dataset is loaded.
PGM index constructed.
==22833== Invalid read of size 4
==22833==    at 0x10B498: find_segment_for_key (pgm_index.hpp:261)
==22833==    by 0x10B498: find_approximate_position (pgm_index.hpp:557)
==22833==    by 0x10B498: main (main.cpp:43)
==22833==  Address 0x519bfe0 is 12 bytes after a block of size 20 alloc'd
==22833==    at 0x4838DEF: operator new(unsigned long) (vg_replace_malloc.c:344)
==22833==    by 0x10D8CB: allocate (new_allocator.h:114)
==22833==    by 0x10D8CB: allocate (alloc_traits.h:444)
==22833==    by 0x10D8CB: _M_allocate (stl_vector.h:343)
==22833==    by 0x10D8CB: std::vector<unsigned int, std::allocator<unsigned int> >::reserve(unsigned long) (vector.tcc:78)
==22833==    by 0x10EA5B: Layer<Segmentation<unsigned int, 16, double> > (pgm_index.hpp:197)
==22833==    by 0x10EA5B: construct<RecursiveStrategy<Segmentation<unsigned int, 128, double>, 16>::Layer, Segmentation<unsigned int, 16, double>&> (new_allocator.h:147)
==22833==    by 0x10EA5B: construct<RecursiveStrategy<Segmentation<unsigned int, 128, double>, 16>::Layer, Segmentation<unsigned int, 16, double>&> (alloc_traits.h:484)
==22833==    by 0x10EA5B: _M_create_node<Segmentation<unsigned int, 16, double>&> (stl_list.h:633)
==22833==    by 0x10EA5B: _M_insert<Segmentation<unsigned int, 16, double>&> (stl_list.h:1907)
==22833==    by 0x10EA5B: emplace_front<Segmentation<unsigned int, 16, double>&> (stl_list.h:1173)
==22833==    by 0x10EA5B: RecursiveStrategy<Segmentation<unsigned int, 128ul, double>, 16ul>::RecursiveStrategy(std::vector<unsigned int, std::allocator<unsigned int> > const&, Segmentation<unsigned int, 128ul, double> const&) (pgm_index.hpp:223)
==22833==    by 0x10AFF9: PGMIndex (pgm_index.hpp:529)
==22833==    by 0x10AFF9: main (main.cpp:37)
==22833== 
==22833== Invalid read of size 8
==22833==    at 0x10B1A1: operator()<unsigned int> (pgm_index.hpp:58)
==22833==    by 0x10B1A1: find_segment_for_key (pgm_index.hpp:266)
==22833==    by 0x10B1A1: find_approximate_position (pgm_index.hpp:557)
==22833==    by 0x10B1A1: main (main.cpp:43)
==22833==  Address 0x36fe8150 is 128 bytes inside a block of size 144 in arena "client"
==22833== 
==22833== Invalid read of size 8
==22833==    at 0x10B1A5: operator()<unsigned int> (pgm_index.hpp:58)
==22833==    by 0x10B1A5: find_segment_for_key (pgm_index.hpp:266)
==22833==    by 0x10B1A5: find_approximate_position (pgm_index.hpp:557)
==22833==    by 0x10B1A5: main (main.cpp:43)
==22833==  Address 0x36fe8158 is 136 bytes inside a block of size 144 in arena "client"
==22833== 
2173840071 to 199999999
search returned:  2173840071
correct position: 199999901
==22833== Warning: set address range perms: large range [0x519c028, 0x34c8c858) (noaccess)
==22833== 
==22833== HEAP SUMMARY:
==22833==     in use at exit: 0 bytes in 0 blocks
==22833==   total heap usage: 25,794,944 allocs, 25,794,944 frees, 2,038,844,430 bytes allocated
==22833== 
==22833== All heap blocks were freed -- no leaks are possible
==22833== 
==22833== For lists of detected and suppressed errors, rerun with: -s
==22833== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

Do I need to do anything special to the PGM index code to work with 32-bit integers? Besides from changing the template argument?

If a PGM-Index is read only, is it thread-safe?

I know it's no really the place for asking questions but I haven't found one.
So, if a PGM-Index is read only, is it thread-safe?

Some keys cannot be found if pgm-index is built in single-thread

While benchmarking the PGM-Index, I found that the key location could not be found when building with the make_segmentation function instead of the make_segmentation_par function. This issue also occurs in other datasets, but it is most frequent in the eth dataset provided by the GRE benchmark (https://github.com/gre4index/GRE). I've uploaded the test code below, so please check it out.

Download PGM-Index and eth dataset.

git clone https://github.com/gvinciguerra/PGM-index.git
cd PGM-index
wget --no-check-certificate -nc https://www.cse.cuhk.edu.hk/mlsys/gre/eth

Modify the code to build the PGM-Index in a single thread.

// file: PGM-index/include/pgm/pgm_index.hpp

    template<typename RandomIt>
    static void build(RandomIt first, RandomIt last,
                      size_t epsilon, size_t epsilon_recursive,
                      std::vector<Segment> &segments,
                      std::vector<size_t> &levels_offsets) {
        ...
        auto build_level = [&](auto epsilon, auto in_fun, auto out_fun) {
            // auto n_segments = internal::make_segmentation_par(last_n, epsilon, in_fun, out_fun);
            auto n_segments = internal::make_segmentation(last_n, epsilon, in_fun, out_fun);
       ...

Change the test code

#include "catch.hpp"
#include "pgm/morton_nd.hpp"
#include "pgm/pgm_index.hpp"
#include "pgm/pgm_index_dynamic.hpp"
#include "pgm/pgm_index_variants.hpp"
#include "pgm/piecewise_linear_model.hpp"
#include "utils.hpp"

#include <algorithm>
#include <cstdint>
#include <cstdio>
#include <cmath>
#include <functional>
#include <iterator>
#include <limits>
#include <map>
#include <random>
#include <string>
#include <tuple>
#include <utility>
#include <vector>

// Loads values from binary file into vector.
static std::vector<uint64_t> load_data(const std::string& filename,
                                bool print = true) {
    std::vector<uint64_t> data;

    std::ifstream in(filename, std::ios::binary);
    if (!in.is_open()) {
        std::cerr << "unable to open " << filename << std::endl;
        exit(EXIT_FAILURE);
    }
    // Read size.
    uint64_t size;
    in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
    data.resize(size);
    // Read values.
    in.read(reinterpret_cast<char*>(data.data()), size * sizeof(uint64_t));
    in.close();
    return data;
}

template <typename Index, typename Data>
void test_index(const Index &index, const Data &data) {
    auto rand = std::bind(std::uniform_int_distribution<size_t>(0, data.size() - 1), std::mt19937{42});

    for (auto i = 1; i <= 10000; ++i) {
        auto q = data[rand()];
        auto range = index.search(q);
        auto lo = data.begin() + range.lo;
        auto hi = data.begin() + range.hi;
        auto k = std::lower_bound(lo, hi, q);
        REQUIRE(*k == q);
    }

    // Test elements outside range
    auto q = data.back() + 42;
    auto range = index.search(q);
    auto lo = data.begin() + range.lo;
    auto hi = data.begin() + range.hi;
    REQUIRE(std::lower_bound(lo, hi, q) == data.end());

    q = std::numeric_limits<typename Data::value_type>::min();
    range = index.search(q);
    lo = data.begin() + range.lo;
    hi = data.begin() + range.hi;
    REQUIRE(std::lower_bound(lo, hi, q) == data.begin());
}

TEMPLATE_TEST_CASE_SIG("PGM-index", "",
                       ((typename T, size_t E1, size_t E2), T, E1, E2),
                       (uint64_t, 8, 4), (uint64_t, 32, 4), (uint64_t, 128, 4),
                       (uint64_t, 256, 256), (uint64_t, 512, 512)) {
    auto data = load_data("./eth");
    pgm::PGMIndex<T, E1, E2> index(data.begin(), data.end());
    test_index(index, data);
}

Build & Run

cmake . -DCMAKE_BUILD_TYPE=Release
make -j8
./test/tests

Test result

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tests is a Catch v2.13.10 host application.
Run with -? for options

-------------------------------------------------------------------------------
PGM-index - uint64_t, 8, 4
-------------------------------------------------------------------------------
/home/PGM-index/test/tests.cpp:85
...............................................................................

/home/PGM-index/test/tests.cpp:68: FAILED:
  REQUIRE( *k == q )
with expansion:
  99249299 (0x5ea6c93)
  ==
  99249300 (0x5ea6c94)

-------------------------------------------------------------------------------
PGM-index - uint64_t, 32, 4
-------------------------------------------------------------------------------
/home/PGM-index/test/tests.cpp:85
...............................................................................

/home/PGM-index/test/tests.cpp:68: FAILED:
  REQUIRE( *k == q )
with expansion:
  99825789 (0x5f3387d)
  ==
  99825790 (0x5f3387e)

-------------------------------------------------------------------------------
PGM-index - uint64_t, 128, 4
-------------------------------------------------------------------------------
/home/PGM-index/test/tests.cpp:85
...............................................................................

/home/PGM-index/test/tests.cpp:68: FAILED:
  REQUIRE( *k == q )
with expansion:
  100297412 (0x5fa6ac4)
  ==
  100297414 (0x5fa6ac6)

===============================================================================
test cases:     5 |     2 passed | 3 failed
assertions: 21480 | 21477 passed | 3 failed

Thank you.

Embed into other languages

Hey thanks for the library. Is there any possibility for a C version of this? Then it would be much easier to embed in other languages. A C-wrapper would also be a good compromise.

BUG: Wrong search result range for a special test case

The following GoogleTest code will not pass:

class TestPGM : public BaseTest {};

using u32 = uint32_t;
using i32 = int32_t;

constexpr u32 TestNum = 100;

TEST_F(TestPGM, TestPGMI32) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<i32> distrib(std::numeric_limits<i32>::lowest(), std::numeric_limits<i32>::max());

    std::vector<i32> data(TestNum);
    std::generate(data.begin(), data.end(), [&] { return distrib(gen); });
    std::sort(data.begin(), data.end());
    PGMIndex<i32> pgm(data);
    for (u32 i = 0; i < TestNum; ++i) {
        auto [pos, lo, hi] = pgm.search(data[i]);
        EXPECT_LE(lo, i);
        EXPECT_GE(hi, i);
    }
}

This bug may be related to issue #49 ?

If I convert data to int64_t type, the search result will all be correct.
For example, following GoogleTest code will pass:

class TestPGM : public BaseTest {};

using u32 = uint32_t;
using i32 = int32_t;
using i64 = int64_t;

constexpr u32 TestNum = 100;

TEST_F(TestPGM, TestPGMI32) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<i32> distrib(std::numeric_limits<i32>::lowest(), std::numeric_limits<i32>::max());

    std::vector<i64> data(TestNum);
    std::generate(data.begin(), data.end(), [&] { return distrib(gen); });
    std::sort(data.begin(), data.end());
    PGMIndex<i64> pgm(data);
    for (u32 i = 0; i < TestNum; ++i) {
        auto [pos, lo, hi] = pgm.search(data[i]);
        EXPECT_LE(lo, i);
        EXPECT_GE(hi, i);
    }
}

Recommendations for large list of strings?

I tried to convert my 125 million domains to a unique set of integers but the integer values exceeded the max for 64bit ints. Does anyone know a way to solve this? Maybe something obvious I am not seeing.

Add support for a vector of int as a key

Hello! @gvinciguerra
With the growing big data, sometimes we may need a vector of integers as the key to index some values.
Currectly support is to use a hash function to combine such a vector of integer into a uint128_t number.

Is there any possible to extend the build-in ML model (f(x) = kx + b).
For example, [x1, x2, x3, ..., xn], build a ML model f(x1, x2, x3, ..., xn) = c0 + c1x1 + c2x2 + ... + cnxn?

In this way, it could become more general. Currently, I use a multiple-level PGM index (#26) to implement this, but seems much more overhead (size, time, etc) than the single-level.

Many thanks.

[Compared with B+tree] operation latency issue

Dear @gvinciguerra
We compare PGM-Index with tlx-implementation-btree. We found the index size has been greatly reduced, while the latency is not optimized as discussed in the paper. Is there any suggestions to our implementations?

// pgm
template<typename KeyType, typename ValueType>
void pgm_tests(KeyType *keys, ValueType *values, uint64_t number) {
	long time_cost;
	struct timeval start, end;
	typedef pgm::DynamicPGMIndex<KeyType, ValueType> PGMFP;
	PGMFP *pgm_index = new PGMFP();
	gettimeofday(&start, NULL);
	for (uint64_t i = 0; i < number; i++) {
		pgm_index->insert_or_assign(keys[i], values[i]);
	}
	gettimeofday(&end, NULL);
	time_cost = 1000000 * (end.tv_sec - start.tv_sec)
			+ (end.tv_usec - start.tv_usec);
	uint64_t size = pgm_index->size_in_bytes();
	printf("PGM: insert done, size %lu, time cost %ld\n", size, time_cost);
	start = end;
	for (uint64_t i = 0; i < number; i++) {
		auto iter = pgm_index->find(keys[i]);
		if (iter->second != values[i]) {
			printf("error %lu\n", i);
			return;
		}
	}
	gettimeofday(&end, NULL);
	time_cost = 1000000 * (end.tv_sec - start.tv_sec)
			+ (end.tv_usec - start.tv_usec);
	printf("PGM: success %lu, size %lu, time cost %ld\n", number, size,
			time_cost);
}

// btree
template<typename KeyType, typename ValueType>
void bptree_tests(KeyType *keys, ValueType *values, uint64_t number) {
	long time_cost;
	struct timeval start, end;
	typedef tlx::btree_map<KeyType, ValueType> bptree;
	bptree *bpt = new bptree();
	gettimeofday(&start, NULL);
	for (uint64_t i = 0; i < number; i++) {
		bpt->insert2(keys[i], values[i]);
	}
	gettimeofday(&end, NULL);
	time_cost = 1000000 * (end.tv_sec - start.tv_sec)
			+ (end.tv_usec - start.tv_usec);
	uint64_t size = bpt->get_stats().memory();
	printf("B+ tree: insert done, size %lu, time cost %ld\n", size, time_cost);
	start = end;
	for (uint64_t i = 0; i < number; i++) {
		auto iter = bpt->find(keys[i]);
		if (iter->second != values[i]) {
			printf("error %lu\n", i);
			return;
		}
	}
	gettimeofday(&end, NULL);
	time_cost = 1000000 * (end.tv_sec - start.tv_sec)
			+ (end.tv_usec - start.tv_usec);
	printf("B+ tree: success %lu, size %lu, time cost %ld\n", number, size,
			time_cost);
}

We store 2^24 kv in three types: uint32_t, uint64_t, uint128_t and the result is:

     2.1 uint32_t:
        PGM: insert done, size 134218700, time cost 45030241
        PGM: success 16777216, size 134218700, time cost 84936539

        B+ tree: insert done, size 321282456, time cost 11842897
        B+ tree: success 16777216, size 321282456, time cost 8484778

     2.2 uint64_t:
        PGM: insert done, size 268436472, time cost 43116521
        PGM: success 16777216, size 268436472, time cost 81205265

        B+ tree: insert done, size 658504360, time cost 11806067
        B+ tree: success 16777216, size 658504360, time cost 8393439

    2.3 uint128_t:
        PGM: insert done, size 536872016, time cost 41139400
        PGM: success 16777216, size 536872016, time cost 82998800

        B+ tree: insert done, size 1436128368, time cost 16172490
        B+ tree: success 16777216, size 1436128368, time cost 9914490

PGMIndex return incorrect range on books dataset

Hi,

for some values of epsilon, PGM returns incorrect ranges on the books dataset from SOSD.

Dataset: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/A6HDNT&version=10.0

Code to reproduce the issue:

#include <vector>
#include <fstream>
#include <iostream>

#include "pgm/pgm_index.hpp"


template<typename Key>
std::vector<Key> load_data(const std::string &filename) {
    using key_type = Key;

    /* Open file. */
    std::ifstream in(filename, std::ios::binary);
    if (!in.is_open())
        exit(EXIT_FAILURE);

    /* Read number of keys. */
    uint64_t n_keys;
    in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));

    /* Initialize vector. */
    std::vector<key_type> data;
    data.resize(n_keys);

    /* Read keys. */
    in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
    in.close();

    return data;
}

int main(int argc, char *argv[])
{
    /* Load keys. */
    auto keys = load_data<uint64_t>("data/books_200M_uint64");
    std::cout << "keys loaded" << std::endl;

    /* Build PGM. */
    const int epsilon = 2048;
    pgm::PGMIndex<uint64_t, epsilon> pgm(keys);
    std::cout << "pgm built" << std::endl;

    /* Lookup all keys. */
    for (std::size_t i = 0; i != keys.size(); ++i) {
        auto key = keys.at(i);
        auto range = pgm.search(key);
        if (range.lo > i or range.hi < i) {
            std::cout << "key not in range\n"
                      << "key: " << key << '\n'
                      << "pos: " << i << '\n'
                      << "lo: " << range.lo << '\n'
                      << "hi: " << range.hi << '\n'
                      << std::endl;
        }
    }
}

Output for epsilon=2048:

...
key not in range
key: 880841552717461696
pos: 47236183
lo: 47236184
hi: 47240282

key not in range
key: 880842439537799360
pos: 47236231
lo: 47236232
hi: 47240330

Output for epsilon=16384:

...
key not in range
key: 1154823330085936896
pos: 61887138
lo: 61887141
hi: 61919911

key not in range
key: 1154823338493908224
pos: 61887139
lo: 61887141
hi: 61919911

key not in range
key: 1154823341621615232
pos: 61887140
lo: 61887141
hi: 61919911

BUG: Cannot create PGM index on a special case

Please try the following code:

std::vector<uint32_t> data(3, std::numeric_limits<uint32_t>::max());
PGMIndex<uint32_t> pgm(data);

Exception "Points must be increasing by x." will be thrown.

This Bug may be related to int type arithmetic overflow.

The following code runs correctly:

std::vector<uint64_t> data(3, std::numeric_limits<uint32_t>::max());
PGMIndex<uint64_t> pgm(data);

There is also a similar bug case for int32_t type:

std::vector<int32_t> data(3, std::numeric_limits<int32_t>::max());
PGMIndex<int32_t> pgm(data);

Can dynamic insertion and removal be supported?

[pgm_index_dynamic] alignment issue.

pgm_index_dynamic.hpp:696:39: runtime error: member call on misaligned address 0x7fff4c74ac09 for type 'struct basic_string', which requires 8 byte alignment.

It seems PGM require 1 byte alignment when using string as value in pgm_index_dynamic. Is there any solutions to address it?

Application of PGM-index

Hi, this is a interesting work! Thanks for your contribution.

But I have some problems:

in the examples, we can see it trains a map function from "Key" (the value of key) to its stored positions. If we really want to use it in real example: the user want to access the value based on the key. Then how values should be organzed with PGM-index? In other words, many (key, value) pairs are stored in a large array and sorted by key (maybe a value is a pointer or offset, etc. It suggests the length of key and value is almost fixed). How to make full use of PGM-index to search a key and return the corresponding value? I believe it should find the position, and then use (position + the length of key) to return the value. However, PGM-index seems need significant upgrade to support this function. Is there any other solutions?
In the paper, it seems "works for any kind of keys that can be mapped to reals while preserving their order. Examples include integers, strings, etc." is very difficult to understand and hard to make implementation. My reasons are given below:
- issue one: the function of map is to map a key (string) to a real, and then use the sorted reals to construct PGM-index and find the offset? Why not I directly map the key to its offset or value. Since the mapping between keys and reals is acceptable, then mapping between keys and offsets (values) should also be acceptable.
- issue two: One important thing is that how string (key type in may DB) can be support in current system. As you can see, a string could have unlimited length. It should be very hard to bound the errors, especially when many short fixed-length strings mixed with a few long strings appears. I also refer to the #8, this hint is very nice, but other issues come out: map string directly to uint128_t at bit-level (maybe) can solve the issue one, but it caused: (1) the max length of string is 24 (?) (2) the wasted space, said some strings have 19 chars, and it can only use uint128_t to do the map. some spaces is wasted then.

Question: about support for int8_t and int16_t type

Are int8_t and int16_t supposed to be valid PGM index key types?

The following GoogleTest code fails for int8_t and int16_t.
However, uint8_t and uint16_t seems to work.

constexpr uint32_t TestNum = 200;

template <std::integral T>
class TestPGM : public BaseTest {
public:
    std::array<T, TestNum> data;
    uint32_t cnt = 0;
    std::unique_ptr<PGMIndex<T>> pgm;

    void SetUp() override {
        std::random_device rd;
        std::mt19937 gen(rd());
        std::uniform_int_distribution<T> distrib(std::numeric_limits<T>::lowest(), std::numeric_limits<T>::max() - 1);
        std::generate(data.begin(), data.end(), [&] { return distrib(gen); });
        std::sort(data.begin(), data.end());
        cnt = std::unique(data.begin(), data.end()) - data.begin();
        pgm = std::make_unique<PGMIndex<T>>(data.begin(), data.begin() + cnt);
    }
    void TearDown() override {}
};

using TestPGMTypes = ::testing::Types<int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t, uint32_t, uint64_t>;

TYPED_TEST_SUITE(TestPGM, TestPGMTypes);

TYPED_TEST(TestPGM, TestPGM) {
    for (uint32_t i = 0; i < this->cnt; ++i) {
        auto [pos, lo, hi] = this->pgm->search(this->data[i]);
        EXPECT_LE(lo, i);
        EXPECT_GE(hi, i);
    }
}

Paper - Documentation

Seems that the publication over https://arxiv.org/abs/1910.06169 can not be reached. It's possible to upload the paper here?

[Bugs]terminate called after throwing an instance of 'std::invalid_argument

Hello, when I runs the following codes

typedef pgm::DynamicPGMIndex<uint16_t, uint16_t> PGMFP;
PGMFP *pgm_index = new PGMFP();
uint64_t number = 1 << 16;
for (uint64_t i = 0; i < number; i++) {
	keys[i] = i;
	value[i] = i + 1;
	pgm_index->insert_or_assign(keys[i], values[i]);
}

It has the following issue:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  The specified value is reserved and cannot be used.

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff78ab859 in __GI_abort () at abort.c:79
#2  0x00007ffff7b75911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff7b8138c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff7b813f7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff7b816a9 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x0000555555575895 in pgm::DynamicPGMIndex<unsigned short, unsigned short, pgm::PGMIndex<unsigned short, 16ul, 4ul, float> >::ItemA::ItemA (this=0x7fffffffe224, key=@0x7ffff7249dfc: 65534, value=@0x7ffff726ac7c: 65535)
    at pgm/pgm_index_dynamic.hpp:689
#7  0x000055555556c24c in pgm::DynamicPGMIndex<unsigned short, unsigned short, pgm::PGMIndex<unsigned short, 16ul, 4ul, float> >::insert_or_assign (this=0x7ffff72091e0, key=@0x7ffff7249dfc: 65534, value=@0x7ffff726ac7c: 65535)

it shows runtime error when executing pgm_index->insert_or_assign(65534, 65535)

Question: About possible support of float type as Key

Before commit 6e2710e , I used to be able to use float type as key and build a PGM index.
In my simple test, the PGM index seems to work with float types.
In commit 6e2710e , the following std::make_unsigned_t code limits the key type to integral types.

I am curious that whether float types are supposed to be supported?

PGM-index/include/pgm/pgm_index.hpp

Line 274 in 5e4c708

auto gap = std::make_unsigned_t<K>(k) - key;

Segfault on large dataset

Hello!

We encountered a segfault when training a PGMIndex<uint64_t, 16, 4> on a larger (800M keys) dataset: https://www.dropbox.com/s/j1d4ufn4fyb4po2/osm_cellids_800M_uint64.zst

Unfortunately, the trace isn't super useful:

#0  0x00007f5af123727e in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
#1  0x00007f5af123cbf7 in nanosleep () from /usr/lib/libc.so.6
#2  0x00007f5af123cb2e in sleep () from /usr/lib/libc.so.6
#3  0x00007f5af16b26ca in debug_wait_gdb () at lib.c:268
#4  <signal handler called>
#5  0x000055eec8226950 in unsigned long make_segmentation<PGMIndex<unsigned long, 16ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#3}, PGMIndex<unsigned long, 16ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1, auto:2, auto:3)#2}>(unsigned long, unsigned long, PGMIndex<unsigned long, 16ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#3}, PGMIndex<unsigned long, 16ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >)::{lambda(auto:1)#3}) ()
#6  0x000055eec8226ffc in PGMIndex<unsigned long, 16ul, 4ul, double>::PGMIndex<__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long const*, std::vector<unsigned long, std::allocator<unsigned long> > >) ()
#7  0x000055eec822744f in std::_Function_handler<void (), sosd::Benchmark<unsigned long>::Run<PGM<unsigned long, 16> >()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#8  0x000055eec81eae5e in util::timing(std::function<void ()>) ()

Thanks,
Ryan

[Segmentation fault] Access structure in DynamicPGMIndex

@gvinciguerra

Hello, when I was doing my test, I found an issue about the DynamicPGMIndex.

        struct A{
	        uint64_t a, b, c;
        };

        pgm::DynamicPGMIndex<__uint128_t, A> d_index;
	for (uint64_t i = 0; i < 0xFFFFF; i++) {
		A a;
		a.a = i;
		a.b = 2 * i;
		a.c = 3 * i;
		d_index.insert_or_assign(i, a);
		if (i % 0x1FFFF == 0) {
			gettimeofday(&end, NULL);
			long time_len = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
			std::cout << (uint32_t)(i/0x1FFFF) << " costs " << time_len << std::endl;
			start = end;
		}
	}
	for (uint64_t i = 0; i < 0xFFFFF; i++) {
		auto iter = d_index.find(i);
		A b = iter->second;
		if(b.a != i || b.b != 2 * i|| b.c != 3 * i){
			printf("failed\n");
			return;
		}
	}
	printf("success\n");

gdb shows the fault happens when executing A b = iter->second;. Does DynamicPGMIndex support a customized structure as its value type?

====================================
By the way, does PGM-index support GC? Is there any suggestions for memory management of using PGM-index.
Hash map indeed is very slow for inserting many data.

For example, in a case with PGM-index , VIRT become 2.5T, while RES is only 250G and cause allocate issue.
under the same case with unordered_map, both VIRT and RSE are around 200G and finish the execution.

No install instructions in CMakeLists.txt + missing PGMConfig.cmake / pgm-config.cmake

Any plans to add proper installation instructions to CMakeLists.txt?

Also, when trying to build columnar CMake returns the following error:

ninja: build stopped: subcommand failed.
CMake Error at cmake/GetPGM.cmake:39 (find_package):
  Could not find a package configuration file provided by "PGM" with any of
  the following names:

    PGMConfig.cmake
    pgm-config.cmake

  Add the installation prefix of "PGM" to CMAKE_PREFIX_PATH or set "PGM_DIR"
  to a directory containing one of the above files.  If "PGM" provides a
  separate development package or SDK, be sure it has been installed.

So, these files should also be installed when installing PGM-index.

Segfault when performing lookup on last key of fb dataset

Hi,
there appears to be a segfault when performing a lookup on the last key (18446744073709551615) of the fb dataset from SOSD. The problem occurs for several values of epsilon.

Dataset: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7&version=10.0

Code to reproduce the issue:

#include <vector>
#include <fstream>
#include <iostream>

#include "pgm/pgm_index.hpp"


template<typename Key>
std::vector<Key> load_data(const std::string &filename) {
    using key_type = Key;

    /* Open file. */
    std::ifstream in(filename, std::ios::binary);
    if (!in.is_open())
        exit(EXIT_FAILURE);

    /* Read number of keys. */
    uint64_t n_keys;
    in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));

    /* Initialize vector. */
    std::vector<key_type> data;
    data.resize(n_keys);

    /* Read keys. */
    in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
    in.close();

    return data;
}

int main(int argc, char *argv[])
{
    /* Load keys. */
    auto keys = load_data<uint64_t>("data/fb_200M_uint64");
    std::cout << "keys loaded" << std::endl;

    /* Build PGM. */
    const int epsilon = 16;
    pgm::PGMIndex<uint64_t, epsilon> pgm(keys);
    std::cout << "pgm built" << std::endl;

    /* Lookup last key. */
    auto key = keys.back();
    std::cout << "searching " << key << "..." << std::endl;
    auto range = pgm.search(key); // <- segfault happens here
    std::cout << "key is between " << range.lo << " and " << range.hi << std::endl;
}

Backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00005555555566a7 in pgm::PGMIndex<unsigned long, 16ul, 4ul, float>::segment_for_key (this=0x7fffffffdea8, key=<optimized out>)
    at ../../third_party/PGM-index/include/pgm/pgm_index.hpp:147
147	                for (; std::next(lo)->key <= key; ++lo)
(gdb) bt
#0  0x00005555555566a7 in pgm::PGMIndex<unsigned long, 16ul, 4ul, float>::segment_for_key (this=0x7fffffffdea8, key=<optimized out>)
    at ../../third_party/PGM-index/include/pgm/pgm_index.hpp:147
#1  pgm::PGMIndex<unsigned long, 16ul, 4ul, float>::search (this=0x7fffffffdea8, key=<optimized out>) at ../../third_party/PGM-index/include/pgm/pgm_index.hpp:194
#2  main (argc=<optimized out>, argv=<optimized out>) at ../../test.cpp:46

Out-of-range predictions and undefined behavior

Hello! Awesome work on the PGM index, this is really cool stuff! I appreciate you taking the time to release the code as well.

I tried using the PGM-index on a dataset of Facebook user IDs, and got a wrong result and found some undefined behavior with valgrind. It is possible that I'm doing something silly.

For reference, you can download the dataset here: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/Y54SI9&version=3.0

The dataset is formatted as densely packed unsigned 64-bit ints, with the first value being the number of items in the dataset (200M).

Here's the driver program I used to read the data and try out the PGM index:

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <iterator>
#include "pgm_index.hpp"

static std::vector<uint64_t> load_data(const std::string& filename) {
  std::vector<uint64_t> data;

  std::ifstream in(filename, std::ios::binary);
  if (!in.is_open()) {
      std::cerr << "unable to open " << filename << std::endl;
      exit(EXIT_FAILURE);
  }
  
  // Read size.
  uint64_t size;
  in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
  data.resize(size);
  
  // Read values.
  in.read(reinterpret_cast<char*>(data.data()), size*sizeof(uint64_t));
  in.close();

  return data;
}


int main(int argc, char **argv) {
    // Generate some random data
    std::vector<uint64_t> dataset = load_data("fb_200M_uint64");
    std::cout << "Dataset is loaded." << std::endl;

    // Construct the PGM-index
    const int error = 128;
    PGMIndex<uint64_t, error> index(dataset);

    std::cout << "PGM index constructed." << std::endl;
    
    // Query the PGM-index
    uint64_t q = 1040094788556900096;
    auto approx_range = index.find_approximate_position(q);
    std::cout << approx_range.lo << " to " << approx_range.hi << "\n";
        
    auto lo = dataset.begin() + approx_range.lo;
    auto hi = dataset.begin() + approx_range.hi;

    auto pos = std::distance(dataset.begin(), std::lower_bound(lo, hi, q));
    std::cout << "search returned:  " << pos << "\n";

    auto corr_pos = std::distance(dataset.begin(), std::lower_bound(dataset.begin(), dataset.end(), q));
    std::cout << "correct position: " << corr_pos << "\n";

    return 0;
}

The program outputs:

50766967 to 50767223
search returned:  50766967
correct position: 22806066

... indicating that the PGM index predicted the location of the search key far more than 128 positions from the correct position.

I added -g to the build flags and ran Valgrind. The PGM index is constructed without any errors, but then the lookup procedure seems to do several invalid reads.

[ryan@ryan-arch-tower pgm_test]$ valgrind ./pgm 
==728290== Memcheck, a memory error detector
==728290== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==728290== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==728290== Command: ./pgm
==728290== 
==728290== Warning: set address range perms: large range [0x59c99040, 0xb927a040) (undefined)
==728290== Warning: set address range perms: large range [0x59c9b037, 0xb927a040) (defined)
Dataset is loaded.
PGM index constructed.
==728290== Invalid read of size 8
==728290==    at 0x10B080: find_segment_for_key (pgm_index.hpp:260)
==728290==    by 0x10B080: find_approximate_position (pgm_index.hpp:556)
==728290==    by 0x10B080: main (main.cpp:43)
==728290==  Address 0x5502390 is 80 bytes inside an unallocated block of size 608 in arena "client"
==728290== 
==728290== Invalid read of size 8
==728290==    at 0x10AD41: operator()<long unsigned int> (pgm_index.hpp:58)
==728290==    by 0x10AD41: find_segment_for_key (pgm_index.hpp:265)
==728290==    by 0x10AD41: find_approximate_position (pgm_index.hpp:556)
==728290==    by 0x10AD41: main (main.cpp:43)
==728290==  Address 0x545d4a0 is 224 bytes inside an unallocated block of size 480 in arena "client"
==728290== 
==728290== Invalid read of size 8
==728290==    at 0x10AD47: operator()<long unsigned int> (pgm_index.hpp:58)
==728290==    by 0x10AD47: find_segment_for_key (pgm_index.hpp:265)
==728290==    by 0x10AD47: find_approximate_position (pgm_index.hpp:556)
==728290==    by 0x10AD47: main (main.cpp:43)
==728290==  Address 0x545d4a8 is 232 bytes inside an unallocated block of size 480 in arena "client"
==728290== 
50766967 to 50767223
search returned:  50766967
correct position: 22806066
==728290== Warning: set address range perms: large range [0x59c99028, 0xb927a058) (noaccess)
==728290== 
==728290== HEAP SUMMARY:
==728290==     in use at exit: 0 bytes in 0 blocks
==728290==   total heap usage: 399,438,306 allocs, 399,438,306 frees, 14,414,997,616 bytes allocated
==728290== 
==728290== All heap blocks were freed -- no leaks are possible
==728290== 
==728290== For lists of detected and suppressed errors, rerun with: -s
==728290== ERROR SUMMARY: 6 errors from 3 contexts (suppressed: 0 from 0)

The error at line 260 seems to be the linear scan code. I'm not sure how line 58 is incurring an invalid read, I'm guessing some inlining happened and the actual error is happening at the array lookup on line 265.

Segmentation fault on facebook (SOSD) dataset

Hi! I am experiencing segmentation fault when DynamicPGM is initialize with the whole FB dataset.(https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7)
It happened during lookups.

Here is the code

#include <iostream>
#include <fstream>
#include <algorithm>
#include <cstdint>
#include <utility>
#include <random>

#include "pgm/pgm_index_dynamic.hpp"

template<class T>
int load_data(T *&data, const std::string &file_path) {
    // open key file
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return 0;
    }

    // read the number of keys
    T max_size;
    is.read(reinterpret_cast<char*>(&max_size), sizeof(T));

    // create array
    data = new T[max_size];

    // read keys
    is.read(reinterpret_cast<char *>(data), std::streamsize(max_size * sizeof(T)));
    is.close();
    return max_size;
}

int main() {
    uint64_t *keys;
    int size = load_data(keys, "fb_200M_uint64");
    std::sort(keys, keys + size);

    auto key_val = new std::pair<uint64_t, uint64_t>[size];
    for(int i = 0; i < size; i ++) {
        key_val[i].first = keys[i];
        key_val[i].second = 1;
    }
    auto index = new pgm::DynamicPGMIndex<uint64_t, uint64_t>(key_val, key_val + size);

    std::random_device rd;
    std::mt19937 g(rd());
    std::shuffle(keys, keys + size, g);
    
    for(int i = 0; i < size; i++) {
        auto iter = index->find(keys[i]);
        if (iter  == index->end()) {
            std::cout << keys[i] << " not found" << std::endl;   
        }
    }
    
}

One or more of the key must have trigger this, I checked with gdb, the key seems to be '18446744073709551615'.

Request guidance on object-to-integer conversion.

This question was asked earlier in #17.

I do not want to beat a dead horse. I just want to put a better/harder/clearer period to it. I've taken the advice in 17. Unless a reply here convinces me otherwise I will use a radix trie as per author's recommendation and come back to PGM when it's a better fit.

But to make sure I'm not missing something it's worth pointing out ... whatever the underlying search structure one has to which we wish to adjoin PGM or, even better, replace in whole or in part with PGM indexing we are confronted with the problem of converting keys to integers. My keys are something like 32-bytes or so on average. There seems to be no trivial way to do this. Certainly it's not the PGM's authors problem to solve, but it might be helpful to identify practical or semi-practical avenues:

Regard the key as a bit string and turn it into a (potentially very large) double then compress it perhaps ala https://en.wikipedia.org/wiki/Arithmetic_coding
Some compression repos that might help are https://github.com/RoaringBitmap/CRoaring https://github.com/powturbo/TurboPFor-Integer-Compression https://github.com/lemire/streamvbyte https://github.com/facebook/zstd

But please note while compression offers many gains here and in other parts of code, compression has to be order preserving to be helpful for indexing.

In defense of PGM authors might point that out extended doubles --- even if they far exceed a typical 8 byte Intel double while still honoring vanilla 8-byte double properties --- still offers benefits that far outweigh caring around keys for indexing

There's also the idea of order preserving minimum hashes. Here the idea is to hash a key and take the resulting hash into PGM. The only thing I could find was https://github.com/DominikHorn/exotic-hashing but I don't think those implementations go very far.

Finally it might be worth pointing out that databases through other work they already had to do have-in-hand integers for keys because they needed an associative map taking integer i to the ith-key. Perhaps i was a byte offset in a file or an offset from base memory address etc.. In that case PGM can be adjoined to good benefit.

To close this issue can we collectively point out some 3-5 avenues for good, practical order preserving integer production from keys for those who don't naturally fall into it otherwise?

big floats indexing issue

Notebook to reproduce issue with pygm

Unfortunately I have not discovered a cause of the error. But I see this issue is only reproducible for floats that are bit enough.

Still, a few keys cannot be found if pgm-index is built in single-thread

Thank you for resolving the issue I reported previously. It seems like most of the issues I brought up last time have been resolved with the last commit. (#44)

So, cases where it can't find the key are occurring much less frequently. However, there are still a few cases where it seems unable to find the key.

Could you check the test code below? The codes below are mostly the same as the ones I uploaded on the issue previously, and I've modified the 4th test code to search for all keys in the dataset.

Thanks!

Download PGM-Index and eth dataset.

git clone https://github.com/gvinciguerra/PGM-index.git
cd PGM-index
wget --no-check-certificate -nc https://www.cse.cuhk.edu.hk/mlsys/gre/eth

Modify the code to build the PGM-Index in a single thread.

// file: PGM-index/include/pgm/pgm_index.hpp

    template<typename RandomIt>
    static void build(RandomIt first, RandomIt last,
                      size_t epsilon, size_t epsilon_recursive,
                      std::vector<Segment> &segments,
                      std::vector<size_t> &levels_offsets) {
        ...
        auto build_level = [&](auto epsilon, auto in_fun, auto out_fun) {
            // auto n_segments = internal::make_segmentation_par(last_n, epsilon, in_fun, out_fun);
            auto n_segments = internal::make_segmentation(last_n, epsilon, in_fun, out_fun);
       ...

Change the test code

#include "catch.hpp"
#include "pgm/morton_nd.hpp"
#include "pgm/pgm_index.hpp"
#include "pgm/pgm_index_dynamic.hpp"
#include "pgm/pgm_index_variants.hpp"
#include "pgm/piecewise_linear_model.hpp"
#include "utils.hpp"

#include <algorithm>
#include <cstdint>
#include <cstdio>
#include <cmath>
#include <functional>
#include <iterator>
#include <limits>
#include <map>
#include <random>
#include <string>
#include <tuple>
#include <utility>
#include <vector>

// Loads values from binary file into vector.
static std::vector<uint64_t> load_data(const std::string& filename,
                                bool print = true) {
    std::vector<uint64_t> data;

    std::ifstream in(filename, std::ios::binary);
    if (!in.is_open()) {
        std::cerr << "unable to open " << filename << std::endl;
        exit(EXIT_FAILURE);
    }
    // Read size.
    uint64_t size;
    in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
    data.resize(size);
    // Read values.
    in.read(reinterpret_cast<char*>(data.data()), size * sizeof(uint64_t));
    in.close();
    return data;
}

template <typename Index, typename Data>
void test_index(const Index &index, const Data &data) {
    auto rand = std::bind(std::uniform_int_distribution<size_t>(0, data.size() - 1), std::mt19937{42});

    for (auto i = 1; i <  data.size(); ++i) {
        auto q = data[rand()];
        auto range = index.search(q);
        auto lo = data.begin() + range.lo;
        auto hi = data.begin() + range.hi;
        auto k = std::lower_bound(lo, hi, q);
        REQUIRE(*k == q);
    }

    // Test elements outside range
    auto q = data.back() + 42;
    auto range = index.search(q);
    auto lo = data.begin() + range.lo;
    auto hi = data.begin() + range.hi;
    REQUIRE(std::lower_bound(lo, hi, q) == data.end());

    q = std::numeric_limits<typename Data::value_type>::min();
    range = index.search(q);
    lo = data.begin() + range.lo;
    hi = data.begin() + range.hi;
    REQUIRE(std::lower_bound(lo, hi, q) == data.begin());
}

TEMPLATE_TEST_CASE_SIG("PGM-index", "",
                       ((typename T, size_t E1, size_t E2), T, E1, E2),
                       (uint64_t, 512, 4)) {
    auto data = load_data("./eth");
    pgm::PGMIndex<T, E1, E2> index(data.begin(), data.end());
    test_index(index, data);
}

Build & Run

cmake . -DCMAKE_BUILD_TYPE=Release
make -j8
./test/tests

Result

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tests is a Catch v2.13.10 host application.
Run with -? for options

-------------------------------------------------------------------------------
PGM-index - uint64_t, 512, 4
-------------------------------------------------------------------------------
/PGM-index/test/tests.cpp:70
...............................................................................

/PGM-index/test/tests.cpp:53: FAILED:
  REQUIRE( *k == q )
with expansion:
  119995231 (0x726fb5f)
  ==
  119995229 (0x726fb5d)

===============================================================================
test cases:       1 |       0 passed | 1 failed
assertions: 1618731 | 1618730 passed | 1 failed

Question: In which case should the DynamicPGMIndex be used?

According to https://pgm.di.unipi.it/ , the DynamicPGMIndex Predecessor query complexity is less effective than normal std::map ?

backup into memory block and restore late

I want to back and send to remote to restore it, any suggestions?

implicit interval tree

I'm curious what would be required to support an interval tree in the PGM.

I see that the type K used in the index must be numeric. Perhaps this could be relaxed?

How is the compression of the intercept calculated

  In the fourth chapter of the paper, it is mentioned that the intercept needs m log(n/m)+1.92m + o(m) bits to store. I have two questions, one is how 1.92m was calculated, and the other is how m log(n/m) was obtained. On the second question, the answer I got from my search was "In the case of PGM indexing, n keys are divided into m line segments. Thus, each line segment contains approximately n/m keys. This means that if we are looking for a particular key in a line segment, we need to be able to represent n/m different values. This is the reason why the intercept of each line segment should be expressed in log(n/m) bits ."
if this answer is correct, does it mean that when you calculate the space usage, you estimate the worst case and reserve this part of the space for the built index before the index is built, or because when calculating the PLA model, the intercept is updated with the expansion of the convex hull? There may be log(n/m) possibilities during the intercept update of each line segment, so each line segment requires log(n/m) bits to represent these intercepts. These are my guesses, I am a beginner, if you can take time out of your busy schedule to help me answer questions, I would be very grateful!

What does this adjustment do?

e346995 added code that appears to add 1 to the end of a run of 2 or more equal values, so long as that wouldn't result in the new value creating or extending a run. But the title is "small fixes" and there isn't any comment. Can you help me understand what this is for?

PGM index construction in infinite loop

Hi, me again... this should be the last one, since it's the last two datasets I have!

It seems like for synthetically generated 64-bit lognormal and normal distributions, the PGM construction can take a very long time (I tried running for over an hour before canceling). After 40 minutes, I sampled once a second with GDB, and the vast majority of the traces were in the std::minmax_element call of at line 75 of piecewise_linear_model.hpp.

Dataset link: http://cs.brandeis.edu/~rcmarcus/lognormal_200M_uint64.xz

Driver program:

#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <iterator>
#include "pgm_index.hpp"

static std::vector<uint64_t> load_data(const std::string& filename) {
  std::vector<uint64_t> data;

  std::ifstream in(filename, std::ios::binary);
  if (!in.is_open()) {
      std::cerr << "unable to open " << filename << std::endl;
      exit(EXIT_FAILURE);
  }
  
  // Read size.
  uint64_t size;
  in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
  data.resize(size);
  
  // Read values.
  in.read(reinterpret_cast<char*>(data.data()), size*sizeof(uint64_t));
  in.close();

  return data;
}


int main(int argc, char **argv) {
    // Generate some random data
    std::vector<uint64_t> dataset = load_data("lognormal_200M_uint64");
    std::cout << "Dataset is loaded." << std::endl;

    // Construct the PGM-index
    const int error = 128;
    PGMIndex<uint64_t, error> index(dataset);

    // this line never prints:
    std::cout << "PGM index constructed." << std::endl;
    return 0;
}

Use of GPL code is incompatible with Apache License

The sdsl library is licensed under the GPL v3 license. You cannot use code with that license in this library without explicit permission from the owner of that library as the terms are incompatible.

BUG: missing nextafter for handling floating point input types

The following code does not handle floating point types correctly:

PGM-index/include/pgm/piecewise_linear_model.hpp

Lines 309 to 310 in c423a88

    
           if (end == n) 
        
               add_point(in(n - 1) + 1, n); // Ensure values greater than the last one are mapped to n

It needs a small fix according to this comment:

PGM-index/include/pgm/piecewise_linear_model.hpp

Line 293 in c423a88

    
           // For floating-point keys, the value x+1 above is replaced by the next representable value following x.

The following code handles floating point types correctly:

PGM-index/include/pgm/piecewise_linear_model.hpp

Lines 294 to 301 in c423a88

    
           if constexpr (std::is_floating_point_v<K>) { 
        
               K next; 
        
               if ((next = std::nextafter(in(i), std::numeric_limits<K>::infinity())) < in(i + 1)) 
        
                   add_point(next, i); 
        
           } else { 
        
               if (in(i) + 1 < in(i + 1)) 
        
                   add_point(in(i) + 1, i); 
        
           }

int128_t/uint128_t support

Is is possible to add int128_t/uint128_t support?

Multiple-level Dynamic PGM-index

@gvinciguerra I have done a test among unordered_map, ALEX, and pgm_index.
The result is very strange

It costs 932159
Hash table size is 175731056
It costs 8694683
alex size is 1575900636
Killed # pgm test for too large memory consumption

Is there any suggestion for designing Multiple-level Dynamic PGM-index? Many thanks in advance.
If you would like to have a try, please use

g++ -std=c++17 -fopenmp -I./PGM-index/include/ -I./ALEX/src/ bench.cpp -march=native -Wall -Wextra

#include <iostream>
#include <sys/random.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
#include <time.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include <unordered_map>
#include <sys/time.h>
#include "pgm/pgm_index.hpp"
#include "pgm/pgm_index_dynamic.hpp"
#include "pgm/pgm_index_variants.hpp"
#include "core/alex.h"

struct KKK {
    uint64_t a;
    uint32_t b, c, d;
};

struct VVV{
	uint64_t a, b, c;
};

typedef pgm::DynamicPGMIndex<uint32_t, VVV> u32P;
typedef pgm::DynamicPGMIndex<uint64_t, u32P*> u64PGM2;
typedef pgm::DynamicPGMIndex<uint64_t, u64PGM2*> u64PGM1;

void pgm_test(std::vector<KKK> key, std::vector<VVV> value){
	u64PGM1 table;
	int number = key.size();
	struct timeval start, end;
	gettimeofday(&start, NULL);

	for (int i = 0; i < number; i++) {
		uint64_t k1 = key[i].a;
		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
		uint32_t k3 = key[i].d;
		auto iter = table.find(k1);
		if(iter == table.end()){
			table.insert_or_assign(k1, new u64PGM2());
			iter = table.find(k1);
		}
		auto p2 = iter->second;
		auto iter1 = p2->find(k2);
		if(iter1 == p2->end()){
			p2->insert_or_assign(k2, new u32P());
			iter1 = p2->find(k2);
		}
		iter1->second->insert_or_assign(k3, value[i]);
	}
	for (int i = 0; i < number; i++) {
		uint64_t k1 = key[i].a;
		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
		uint32_t k3 = key[i].d;
		auto iter = table.find(k1)->second->find(k2)->second->find(k3);
		VVV b = iter->second;
		if (b.a != value[i].a || b.b != value[i].b
				|| b.c != value[i].c) {
			printf("failed\n");
			return;
		}
	}
	gettimeofday(&end, NULL);
	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
			+ (end.tv_usec - start.tv_usec);
	std::cout << "It costs " << time_len << std::endl;
	uint64_t size = table.size_in_bytes();

	for (auto &e : table) {
		size += e.second->size_in_bytes();
		for(auto &e1 : *(e.second)){
			size += e1.second->size_in_bytes();
		}
	}
	printf("pgm size is %lld\n", size);
}

typedef alex::Alex<uint32_t, VVV> u32A;
typedef alex::Alex<uint64_t, u32A*> u64P2;
typedef alex::Alex<uint64_t, u64P2*> u64P1;

void alex_test(std::vector<KKK> key,
		std::vector<VVV> value) {
	u64P1 table;
	int number = key.size();
	struct timeval start, end;
	gettimeofday(&start, NULL);
	for (int i = 0; i < number; i++) {
		uint64_t k1 = key[i].a;
		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
		uint32_t k3 = key[i].d;
		auto iter = table.find(k1);
		if (iter == table.end()) {
			table.insert(k1, new u64P2());
			iter = table.find(k1);
		}
		auto p2 = iter.payload();
		auto iter1 = p2->find(k2);
		if (iter1 == p2->end()) {
			p2->insert(k2, new u32A());
			iter1 = p2->find(k2);
		}
		iter1.payload()->insert(k3, value[i]);
	}
	for (int i = 0; i < number; i++) {
		uint64_t k1 = key[i].a;
		uint64_t k2 = ((uint64_t)key[i].b) << 32 + (uint64_t)key[i].c;
		uint32_t k3 = key[i].d;
		auto iter = table.find(k1).payload()->find(k2).payload()->find(k3);
		VVV b = iter.payload();
		if (b.a != value[i].a || b.b != value[i].b
				|| b.c != value[i].c) {
			printf("failed\n");
			return;
		}
	}
	gettimeofday(&end, NULL);
	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
			+ (end.tv_usec - start.tv_usec);
	std::cout << "It costs " << time_len << std::endl;

	uint64_t size = table.data_size() + table.model_size();

	for (auto e = table.begin(); e != table.end(); e++) {
		auto t2 = e.payload();
		size += t2->data_size();
		size += t2->model_size();
		for(auto e1 = t2->begin(); e1 != t2->end(); e1++){
			size += e1.payload()->data_size();
			size += e1.payload()->model_size();
		}
	}
	printf("alex size is %lld\n", size);
}

struct TupleHasher {
    std::size_t
    operator()(const KKK &key) const {
        return key.a;
    }
};

struct TupleEqualer {
    bool operator()(const KKK &lhs, const KKK &rhs) const {
        return (lhs.a == rhs.a) && (lhs.b == rhs.b) && (lhs.c == rhs.c) && (lhs.d == rhs.d);
    }
};

void hashmap_test(std::vector<KKK> key,
		std::vector<VVV> value) {
	std::unordered_map<KKK, VVV, TupleHasher, TupleEqualer> table;
	int number = key.size();
	struct timeval start, end;
	gettimeofday(&start, NULL);
	for (int i = 0; i < number; i++) {
		table[key[i]] = value[i];
	}
	for (int i = 0; i < number; i++) {
		VVV b = table[key[i]];
		if (b.a != value[i].a || b.b != value[i].b
				|| b.c != value[i].c) {
			printf("failed\n");
			return;
		}
	}
	gettimeofday(&end, NULL);
	int time_len = 1000000 * (end.tv_sec - start.tv_sec)
			+ (end.tv_usec - start.tv_usec);
	std::cout << "It costs " << time_len << std::endl;
	uint64_t size = 0;

	for (unsigned i = 0; i < table.bucket_count(); ++i) {
		size_t bucket_size = table.bucket_size(i);
		if (bucket_size == 0) {
			size += sizeof(void*);
		} else {
			size += bucket_size * (sizeof(KKK) + sizeof(VVV));
			size += (bucket_size - 1) * sizeof(void*);
		}
	}

	printf("Hash table size is %lld\n", size);
}

int main(){
	std::vector<KKK> x_key;
	std::vector<VVV> x_value;
	for(int i=0; i<3199807; i++){
		KKK a;
		VVV b;
               // you can also generate them randomly
		a.a = i >> 96;
		a.b = i >> 64;
		a.c = i >> 32;
		a.d = i;
		b.a = 1 * i;
		b.b = 2 * i;
		b.c = 3 * i;
		x_key.push_back(a);
		x_value.push_back(b);
	}
	hashmap_test(x_key, x_value);
	alex_test(x_key, x_value);
	pgm_test(x_key, x_value);
}

Add support for Windows?

On Visual Studio 2019, the code does not build:

$ cl -I../include /O2 /permissive- /std:c++17 simple.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29337 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

simple.cpp
Compilation with -fopenmp is optional but recommended
C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C4235: nonstandard extension used: '__int128' keyword not supported on this architecture
C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C2059: syntax error: '>'
C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(37): error C2062: type 'unknown-type' unexpected
C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(40): error C2631: 'OptimalPiecewiseLinearModel': a class or enum cannot be defined in an alias template
../include\pgm/pgm_index.hpp(24): error C2143: syntax error: missing ';' before '{'
../include\pgm/pgm_index.hpp(24): error C2447: '{': missing function header (old-style formal list?)
simple.cpp(24): error C2039: 'PGMIndex': is not a member of 'pgm'
C:\cygwin\tmp\PGM-index\include\pgm\piecewise_linear_model.hpp(32): note: see declaration of 'pgm'
simple.cpp(24): error C2065: 'PGMIndex': undeclared identifier
simple.cpp(24): error C2062: type 'int' unexpected
simple.cpp(28): error C2065: 'index': undeclared identifier
simple.cpp(31): error C2672: 'lower_bound': no matching overloaded function found
simple.cpp(31): error C2893: Failed to specialize function template '_FwdIt std::lower_bound(_FwdIt,_FwdIt,const _Ty &)'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\xutility(5871): note: see declaration of 'std::lower_bound'
simple.cpp(31): note: With the following template arguments:
simple.cpp(31): note: '_FwdIt=auto'
simple.cpp(31): note: '_Ty=auto'
simple.cpp(31): error C2780: '_FwdIt std::lower_bound(_FwdIt,const _FwdIt,const _Ty &,_Pr)': expects 4 arguments - 3 provided
C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\xutility(5849): note: see declaration of 'std::lower_bound'
simple.cpp(35): fatal error C1004: unexpected end-of-file found

	if (end == n)
	add_point(in(n - 1) + 1, n); // Ensure values greater than the last one are mapped to n

	if constexpr (std::is_floating_point_v<K>) {
	K next;
	if ((next = std::nextafter(in(i), std::numeric_limits<K>::infinity())) < in(i + 1))
	add_point(next, i);
	} else {
	if (in(i) + 1 < in(i + 1))
	add_point(in(i) + 1, i);
	}

gvinciguerra / pgm-index Goto Github PK

pgm-index's People

Contributors

Stargazers

Watchers

Forkers

pgm-index's Issues

Recommend Projects

Recommend Topics

Recommend Org