Giter Site home page Giter Site logo

alex's Introduction

Introduction

ALEX is a ML-enhanced range index, similar in functionality to a B+ Tree. Our implementation is a near drop-in replacement for std::map or std::multimap. You can learn more about ALEX in our SIGMOD 2020 paper.

Table of Contents

Getting Started
Design Overview
API Documentation
Contributing

Getting Started

ALEX can be used as a header-only library. All relevant header files are found in src/core. You will need to compile your program with at least the C++14 standard (e.g., -std=c++14).

In this repository, we include three programs that you can compile and run:

On Windows, simply load this repository into Visual Studio as a CMake project. On Linux/Mac, use the following commands:

# Build using CMake, which creates a new build directory
./build.sh

# Build using CMake in Debug Mode, which creates a new build_debug directory and defines the macro ‘DEBUG'
./build.sh debug

# Run example program
./build/example

# Run unit tests
./build/test_alex

To run the benchmark on a synthetic dataset with 1000 normally-distributed keys:

./build/benchmark \
--keys_file=resources/sample_keys.bin \
--keys_file_type=binary \
--init_num_keys=500 \
--total_num_keys=1000 \
--batch_size=1000 \
--insert_frac=0.5

However, to observe the true performance of ALEX, we must run on a much larger dataset. You can download a 200M-key (1.6GB) dataset from Google Drive. To run one example workload on this dataset:

./build/benchmark \
--keys_file=[download location] \
--keys_file_type=binary \
--init_num_keys=10000000 \
--total_num_keys=20000000 \
--batch_size=1000000 \
--insert_frac=0.5 \
--lookup_distribution=zipf \
--print_batch_stats

You can also run this benchmark on your own dataset. Your keys will need to be in either binary format or text format (one key per line). If the data type of your keys is not double, you will need to modify #define KEY_TYPE double to #define KEY_TYPE [your data type] in src/benchmark/main.cpp.

Datasets

The four datasets we used in our SIGMOD 2020 paper are publicly available (all in binary format):

Design Overview

Like the B+ Tree, ALEX is a data structure that indexes sorted data and supports workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. Internally, ALEX uses a collection of linear regressions, organized hierarchically into a tree, to model the distribution of keys. ALEX uses this model to efficiently search for data records by their key. ALEX also automatically adapts its internal models and tree structure to efficiently support writes.

ALEX is inspired by the original learned index from Kraska et al.. However, that work only supports reads (i.e., point lookups and range queries), while ALEX also efficiently supports write (i.e., inserts, updates, and deletes).

In our paper, we show that ALEX outperforms alternatives in both speed and size:

  • On read-only workloads, ALEX beats the original learned index from Kraska et al. by up to 2.2X on performance with up to 15X smaller index size.
  • Across the spectrum of read-write workloads, ALEX beats B+ Trees (implemented by STX B+ Tree) by up to 4.1X while never performing worse, with up to 2000X smaller index size.

You can find many more details about ALEX here.

Limitations and Future Research Directions

ALEX currently operates in memory, single threaded, and on numerical keys. We are considering ways to add support for persistence, concurrency, and string keys to ALEX.

In terms of performance, ALEX has a couple of known limitations:

  • The premise of ALEX is to model the key distribution using a collection of linear regressions. Therefore, ALEX performs poorly when the key distribution is difficult to model with linear regressions, i.e., when the key distribution is highly nonlinear at small scales. A possible future research direction is to use a broader class of modeling techniques (e.g., also consider polynomial regression models).
  • ALEX can have poor performance in the presence of extreme outlier keys, which can cause the key domain and ALEX's tree depth to become unnecessarily large (see Section 5.1 of our paper). A possible future research direction is to add special logic for handling extreme outliers, or to have a modeling strategy that is robust to sparse key spaces.

API Documentation

We provide three user-facing implementations of ALEX:

  1. AlexMap is a near drop-in replacement for std::map.
  2. AlexMultiMap is a near drop-in replacement for std::multimap.
  3. Alex is the internal implementation that supports both AlexMap and AlexMultimap. It exposes slightly more functionality.

ALEX has a few important differences compared to its standard library equivalents:

  • Keys and payloads (i.e., the mapped type) are stored separately, so dereferencing an iterator returns a copy of the key/payload pair, not a reference. Our iterators have methods to directly return references to the key or payload individually.
  • The iterators are of type ForwardIterator, instead of BidirectionalIterator. Therefore, iterators do not support decrementing.
  • Currently, we only support numerical key types. We do not support arbitrary user-defined key types. As a result, you should not change the default comparison function in the class template.

Detailed API documentation can be found in our wiki.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project follows the Google C++ Style Guide.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

alex's People

Contributors

badrishc avatar baotonglu avatar curtis-sun avatar duanyang25 avatar geoffxy avatar hey-kong avatar jialinding avatar jiayuasu avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mrpekar98 avatar stoianmihail avatar tusharpm avatar ufminhas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alex's Issues

Cannot replicate experiments in Table 2

Dear Jialing,

I want to replicate the experiments of Table 2 in the paper.
image
In section 6.1.2, the paper stated that:

For a given dataset, we initialize an index with 100 million keys.

And I need help with the following 3 problems:

  • Q1: Does Avg depth denote the average depth of [all nodes(model nodes and data nodes)] or simply [data nodes]?
  • Q2: I assume Table 2 means bulk loading only 100 million keys for 4 datasets, right?
  • Q3: I tried to replicate the experiments in Table 2, but got different statistics, could you help me check if there is something I miss? (or maybe the later commits have slightly changed some features?)

For Q3, I used the benchmark to test them and get different statistics. I set both init_num_keys and total_num_keys to 100M, so that benchmark will only bulk-load 100M keys. For example,

./build/benchmark
--keys_file=[path to location of a given dataset]
--keys_file_type=binary
--init_num_keys=100000000
--total_num_keys=100000000
--batch_size=100000
--insert_frac=0.5
--lookup_distribution=zipf
--print_batch_stats

For longitudes, I bulk-load 100M keys. The results are as follows:
image

For longlat, I bulk-load 100M keys. The results are as follows:
image

For lognormal, I bulk-load 100M keys. The results are as follows:
image

For YCSB, I bulk-load 100M keys. The results are as follows:
image

Thank you so much for your time.

Segmentation fault after insertions/lookups/deletions

I was testing this library and I ran into a segmentation fault. Here's a program that reproduces it:

$ cat alex-test.cc
#include <stdio.h>

#include "ALEX/src/core/alex_map.h"

int main(int argc, char **argv) {
  alex::AlexMap<uint64_t, uint64_t> map;

  FILE *f = fopen("alex-test-input.txt", "r");
  while (1) {
    char buf[100];
    char *line = fgets(buf, sizeof buf, f);
    if (!line) break;

    uint64_t key;
    if (line[0] == 'a') {
      sscanf(line, "a %lu", &key);
      printf("assign %lu\n", key);
      map[key] = 0;
    } else if (line[0] == 'd') {
      sscanf(line, "d %lu", &key);
      auto it = map.find(key);
      if (!it.is_end()) {
        printf("delete %lu\n", key);
        map.erase(it);
      } else {
        printf("!found %lu\n", key);
      }
    }
  }

  return 0;
}
$ clang++ -g alex-test.cc -o build/alex-test -march=haswell && ./build/alex-test > /dev/null
Segmentation fault

Here's a stack trace:

#0  0x000000000040a456 in alex::Alex<unsigned long, unsigned long, alex::AlexCompare, std::allocator<std::pair<unsigned long, unsigned long> >, false>::expand_root (
    this=0x7fffffffdb10, expand_left=true) at ./ALEX/src/core/alex.h:1454
#1  0x00000000004081f8 in alex::Alex<unsigned long, unsigned long, alex::AlexCompare, std::allocator<std::pair<unsigned long, unsigned long> >, false>::insert (this=0x7fffffffdb10,
    key=@0x7fffffffda80: 9178834004329713693, payload=@0x7fffffffd9d8: 0)
    at ./ALEX/src/core/alex.h:1117
#2  0x000000000040327d in alex::AlexMap<unsigned long, unsigned long, alex::AlexCompare, std::allocator<std::pair<unsigned long, unsigned long> > >::operator[] (
    this=0x7fffffffdb10, key=@0x7fffffffda80: 9178834004329713693)
    at ./ALEX/src/core/alex_map.h:128
#3  0x0000000000401640 in main (argc=1, argv=0x7fffffffdce8) at alex-test.cc:18

Here's the test input: https://slbkbs.org/tmp/alex-test-input.txt

Unable to erase previously inserted key

I've encountered some strange behavior where I cannot erase a key that was previously inserted into ALEX.

I've put together a self-contained program that reproduces the problem along with a trace. The program can be compiled by placing it under src in this repository (and adding an add_executable() to the CMakeLists.txt file). To reproduce, just run the compiled program with the provided trace.

The program and trace will bulk load ALEX with 9000 64-bit keys and then run a series of inserts and deletes. The delete of the last key in the trace (347892347588) appears to fail because erase() returns 0. However, the key should exist in ALEX because it was inserted earlier (line 16694 in the trace) and was not deleted afterwards.

I was able to reproduce the problem in both Debug and Release mode. I tried compiling on machines with g++ v9.3.0 and v11.1.0 and was able to reproduce this problem with both compiler versions.

Trace:
alex_remove_not_found.log

Program:

#include <cstdint>
#include <fstream>
#include <iostream>
#include <string>
#include <utility>
#include <vector>

#include "core/alex.h"

namespace {

int LoadAndReplay(const std::string& path) {
  alex::Alex<uint64_t, uint64_t> index;
  std::vector<std::pair<uint64_t, uint64_t>> entries;

  std::ifstream trace(path);
  uint64_t lineno = 0;
  std::string type;
  uint64_t key;

  bool load_phase = true;

  while (trace >> type >> key) {
    lineno++;

    if (type == "L") {
      entries.emplace_back(key, lineno);
      continue;
    }

    // Done reading keys that should be bulk loaded.
    if (load_phase) {
      index.bulk_load(entries.data(), entries.size());

      // Validate the bulk load.
      size_t i = 0;
      for (const auto& rec : index) {
        if (rec.first != entries[i].first) {
          std::cerr << "After the bulk load, ALEX is missing the key ("
                    << entries[i].first << ")." << std::endl;
          return 3;
        }
        ++i;
      }

      load_phase = false;
    }

    if (type == "I") {
      const auto res = index.insert(key, lineno);
      if (!res.second) {
        std::cerr << "Failed to insert key (" << key << "). Line " << lineno
                  << std::endl;
        return 2;
      }
    }
    if (type == "R") {
      const auto num_erased = index.erase(key);
      if (num_erased == 0) {
        std::cerr << "Failed to erase key that should exist (" << key
                  << "). Line " << lineno << std::endl;
        return 1;
      }
    }
  }
  std::cerr << "Trace replayed successfully." << std::endl;
  return 0;
}

}  // namespace

int main(int argc, char* argv[]) {
  if (argc != 2) {
    std::cerr << "Usage: " << argv[0] << " path/to/trace.log" << std::endl;
    return 1;
  }
  return LoadAndReplay(argv[1]);
}

Expected output:

Trace replayed successfully.

Actual output:

Failed to erase key that should exist (347892347588). Line 36813

FLAG Setting: target specific option mismatch for _tzcnt_u64

When I compile ALEX with uint64_t as key, I encounter this error:

target specific option mismatch for _tzcnt_u64

I have added flags: -Wextra -Wundef -mavx -mpopcnt -mbmi.

This is a x86_64 ubuntu with cpu:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities

======================================
add flags -march=native -Wall -Wextra solves this issue. Closed.

Compiler Error: target specific option mismatch

When compiling on Linux, I get the following errors:

In file included from /usr/lib64/gcc/x86_64-suse-linux/7/include/immintrin.h:79:0,
                 from /usr/lib64/gcc/x86_64-suse-linux/7/include/x86intrin.h:48,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/src/core/alex_base.h:28,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/src/core/alex.h:38,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/test/unittest_alex.h:6,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/test/unittest_main.cpp:7:
/usr/lib64/gcc/x86_64-suse-linux/7/include/lzcntintrin.h:64:1: error: inlining failed in call to always_inline ‘long long unsigned int _lzcnt_u64(long long unsigned int)’: target specific option mismatch
 _lzcnt_u64 (unsigned long long __X)
 ^~~~~~~~~~
In file included from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/src/core/alex_fanout_tree.h:13:0,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/src/core/alex.h:39,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/test/unittest_alex.h:6,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/test/unittest_main.cpp:7:
/vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/src/core/alex_nodes.h:2062:56: note: called from here
             bit_pos - (63 - static_cast<int>(_lzcnt_u64(bitmap_left_gaps)));
                                              ~~~~~~~~~~^~~~~~~~~~~~~~~~~~
In file included from /usr/lib64/gcc/x86_64-suse-linux/7/include/immintrin.h:79:0,
                 from /usr/lib64/gcc/x86_64-suse-linux/7/include/x86intrin.h:48,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/src/core/alex_base.h:28,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/src/core/alex.h:38,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/test/unittest_alex.h:6,
                 from /vol/fob-vol4/nebenf17/schraluk/Masterarbeit/ALEX/test/unittest_main.cpp:7:
/usr/lib64/gcc/x86_64-suse-linux/7/include/lzcntintrin.h:64:1: error: inlining failed in call to always_inline ‘long long unsigned int _lzcnt_u64(long long unsigned int)’: target specific option mismatch
 _lzcnt_u64 (unsigned long long __X)

This is in fact the same as stated in that issue here: #13

The problem is, the flags named in the solution are already in the CMakeLists.txt file, so I cant add them to solve the issue?
Or am I doing something wrong?

cmake_minimum_required(VERSION 3.12)
project(alex)

set(CMAKE_CXX_STANDARD 14)

# Define the macro ‘DEBUG' in the debug mode
if(CMAKE_BUILD_TYPE STREQUAL Debug)        
    ADD_DEFINITIONS(-DDEBUG)               
endif()

if(MSVC)
    set(CMAKE_CXX_FLAGS "/O2 /arch:AVX2 /W1 /EHsc")
elseif (CMAKE_CXX_COMPILER_ID STREQUAL "Intel")
    set(CMAKE_CXX_FLAGS "-O3 -xHost")
else()
    # clang and gcc
    set(CMAKE_CXX_FLAGS "-O3 -march=native -Wall -Wextra")
endif()

include_directories(src/core)

add_executable(example src/examples/main.cpp)
add_executable(benchmark src/benchmark/main.cpp)

set(DOCTEST_DOWNLOAD_DIR ${CMAKE_CURRENT_BINARY_DIR}/doctest)
file(DOWNLOAD
    https://raw.githubusercontent.com/onqtam/doctest/2.4.6/doctest/doctest.h
    ${DOCTEST_DOWNLOAD_DIR}/doctest.h
    EXPECTED_HASH SHA1=40728d2bed7f074e80cb095844820114e7d16ce0
)

add_executable(test_alex test/unittest_main.cpp)
target_include_directories(test_alex PRIVATE ${DOCTEST_DOWNLOAD_DIR})

enable_testing()
add_test(test_alex test_alex)

Key is correct but payload is wrong

I modify the unit test of alex as follow:

TEST(Alex, TestInsertFromEmpty) {
  Alex<int, int> index;

  Alex<int, int>::V values[200];
  for (int i = 0; i < 200; i++) {
    values[i].first = rand() % 500;
    values[i].second = i;
  }

  for (int i = 0; i < 200; i++) {
    auto ret = index.insert(values[i].first, values[i].second);
    EXPECT_EQ(ret.first.key(), values[i].first);
  }

  // Check that getting the key is correct.
  for (int i = 0; i < 200; i++) {
    auto it = index.find(values[i].first);
    EXPECT_TRUE(!it.is_end());
    EXPECT_EQ(values[i].first, it.key());
    EXPECT_EQ(values[i].second, it.payload());
  }
}

I just add a line

EXPECT_EQ(values[i].second, it.payload());

But it will lead to the failure of the test.
image

ALEX crashes in non-batch mode

I ran ALEX against all of the SOSD benchmark's uint64 datasets (download link). I tried to insert the first 100M records, but ALEX crashed for all datasets.

When I insert the first few thousand records (e.g., 100K records), the code does not crash, but ALEX crashes for a larger number of insertions across all datasets. It's most likely a case that went unnoticed. The error message is as follows:

alex_nodes.h:540: int alex::AlexDataNode<T, P, Compare, Alloc, allow_duplicates>::num_keys_in_range(int, int) const [with T = long unsigned int; P = int; Compare = alex::AlexCompare; Alloc = std::allocator<std::pair<long unsigned int, int> >; bool allow_duplicates = true]: Assertion `left >= 0 && left < right && right <= data_capacity_' failed. Aborted (core dumped)

image

I have plenty of memory (50GB+) so it is certainly not a memory issue.

The code is simple. I just insert the records one by one, up to MAX_INSERTS

#include <iostream>
#include <vector>
#include <algorithm>
#include <time.h>

#include "alex.h"

#include "load.hpp"

using namespace std;

#define KEY_TYPE uint64_t
#define PAYLOAD_TYPE int

#define MAX_INSERTS 100000000

template <typename T>
void test_alex(vector<T> &data, string filename)
{
	alex::Alex<KEY_TYPE, PAYLOAD_TYPE> index;

	int cnt = 0;

	for (typename vector<T>::iterator it = data.begin(); it != data.end(); it++)
		if (cnt <= MAX_INSERTS)
			index.insert(*it, ++cnt);

	cout << cnt << endl;
}

int main(int argc, char **argv)
{
	string filename = argv[1];

	vector<uint64_t> data;
	load_bin<uint64_t>(filename, data);
	sort(data.begin(), data.end());
	data.erase(unique(data.begin(), data.end()), data.end());

	test_alex(data, filename);
}

The meaning of expected shifts approximation formula

// ((n-1)/2)((n-1)/2 + 1) = n^2/4 - 1/4

The paper said that in the assumption that "inserts are done according to the existing key distribution", the "average number of shifts for inserts is computed as the average distance to the closest gap in the Gapped Array for all existing keys".

I dont know how to deduce this formula (n-1)/2 * ((n-1)/2+1) from above. Can anybody explain it in detail? Thanks.

Keys out of order after insert/delete/lookup operations

This is the original problem I ran into before #1, but it exists even after applying #2 (in its current state). It only happens when applying lookups. Test program:

$ cat alex-test.cc
#include <stdio.h>

#include "ALEX/src/core/alex_map.h"

int main(int argc, char **argv) {
  alex::AlexMap<uint64_t, uint64_t> map;

  FILE *f = fopen("alex-test-input-lookups.txt", "r");
  while (1) {
    char buf[100];
    char *line = fgets(buf, sizeof buf, f);
    if (!line) break;

    uint64_t key;
    if (line[0] == 'a') {
      sscanf(line, "a %lu", &key);
      printf("assign %lu\n", key);
      map[key] = 0;
    } else if (line[0] == 'd') {
      sscanf(line, "d %lu", &key);
      auto it = map.find(key);
      if (!it.is_end()) {
        printf("delete %lu\n", key);
        map.erase(it);
      } else {
        printf("!found %lu\n", key);
      }
    } else if (line[0] == 'l') {
      sscanf(line, "l %lu", &key);
      auto it = map.find(key);
      printf("lookup %lu (%s)\n", key, it.is_end() ? "not found" : "found");
    }
  }

  uint64_t last_key = 0;
  for (auto it = map.begin(); !it.is_end(); ++it) {
    uint64_t key = (*it).first;
    if (key < last_key) {
      printf("out of order!\n");
      printf("last %020lu\n", last_key);
      printf("key  %020lu\n", key);
    }
    last_key = key;
  }

  return 0;
}
$ clang++ -g alex-test.cc -o build/alex-test -march=haswell && ./build/alex-test | tail -n3
out of order!
last 14654630671376060525
key  14623995723005132922

The test file is at http://slbkbs.org/tmp/alex-test-input-lookups.txt

Segmentation fault problem when lookup after bulk load

I write a simple test where the index first bulk load 2000000 keys, then carry out lookup operation one by one. But there will be a segmentation fault.

TEST(Alex, TestLookup) {
    Alex<int, int> index;
    Alex<int, int>::V values[2000000];

    for (int i = 0; i < 2000000; i++) {
        values[i].first = i;
        values[i].second = i;
    }

    std::sort(values, values + 2000000,
        [](auto const &a, auto const &b) { return a.first < b.first; });
    index.bulk_load(values, 2000000);
    EXPECT_EQ(2000000, index.size());

    for (int i = 0; i < 2000000; i++)
    {
        EXPECT_EQ(2000000, index.size());

        auto it = index.find(i);
        EXPECT_TRUE(!it.is_end()) << "The index of iterator is " << i;
        EXPECT_EQ(i, it.key());

        int *p = index.get_payload(i);
        EXPECT_TRUE(p);
        EXPECT_EQ(i, *p);

        EXPECT_EQ(2000000, index.size());
    }

    for (int i = 0; i < 2000000; i++)
    {
        auto it = index.find(i);
        EXPECT_TRUE(!it.is_end()) << "The index of iterator is " << i;
        EXPECT_EQ(i, it.key());

        int *p = index.get_payload(i);
        EXPECT_TRUE(p);
        EXPECT_EQ(i, *p);
    }
}

image

Insertion uses all memory

Hi all,

In my PySpark script, I call ALEX to insert around 1 million records. In some cases, ALEX uses all memory and the program crashes.
I am still trying to figure out the specific cause. I wonder if anyone has similar issues. Any input is appreciated.

Bests,

alex insert on Ubuntu

Hello, I had a problem when compiling and running alex example(alex/src/examples/main.cpp) on Ubuntu. Here is the compile command and options:
g++ -march=native -mlzcnt -mbmi -std=c++14 -g main.cpp
when I run the code, I got this error:

a.out: ../core/alex_nodes.h:484:void alex::AlexDataNode<T, P, compare, Alloc, allow_duplicates>::set_bit(int) [with T = double; P = double; Compare = alex::AlexCompare; Alloc = std::allocator<std::pair<double, double> >; bool allow_dupliates = true]: Assertion 'pos >= 0 && pos < data_capacity_' failed.
Aborted.

The same error happened when running benchmark.

The OS is Ubuntu 14.04 LTS, and the CPU is Intel Xeon E5-2609.

I run both example and benchmark on Windows successfully. So I am not sure what caused the error on Ubuntu.

Thanks for your help on insert operation. Greatly appreciate your help.

Question about Figure 9: ALEX supports redundant key values, but STX B+ Tree does not has such feature

Dear Jialing,

ALEX supports inserting <key, payload> pairs with the same key.
For example, if you insert the following 10 pairs with the same key 42:

  • <42, 1980013608>
  • <42, 678433864>
  • <42, 1161402928>
  • <42, 1467459259>
  • <42, 1692683787>
  • <42, 998901092>
  • <42, 475827625>
  • <42, 60631511>
  • <42, 281718131>
  • <42, 359319102>

The inserting results will keep the 10 pairs in the ALEX structure.

However, for STX B+ Tree[5], it supports inserting only 1 <key, payload> pair with a specific key.
Therefore, if you insert the 10 pairs with key 42 in the previous example, STX B+ Tree will keep only 1 pair of them. Actually, only the first pair can be inserted while the rest 9 pairs will fail.

image

image

I thus wonder how to conduct the experiments in Figure 9 to compare the throughput between ALEX and B+ Tree.
Do you change the STX B+ Tree(v0.9) so that B+ Tree could support inserting pairs with the same key?

Could I have the four datasets?

Hi, I am wondering if I could get the four datasets in the paper?

longitudes
longlat
lognormal
YCSB
The given sample dataset is only 200M, which is part of the longitudes dataset.
I have also tried to extract the dataset from the open street map, but I assume that there must be a strategy that you use to select the longitudes or the latitudes. Could you mention a little about this strategy? Thanks.

DataNode: `min_key_` not set.

Hi,

The variable max_key_ is set several times in alex_nodes.h. On the other hand, min_key_ is updated only in insert (missing in both bulk_loads). I just wanted to make sure that this is the intended behavior (and why).

Thanks!
Mihail.

cost questions

I read your code .the misterious part is that how do you calculate the cost of lookup,smo,insert,shift cost,you set it to default 20,but I am curious how that comes,and also why you set it to a int

expand_root: some new_children pointers are not assigned

Hi,

In function "expand_root" (alex.h), you use two variables named "in_bounds_new_nodes_start" and "in_bounds_new_nodes_end" to control the boundary of "root->children_". So some elements in "root->children_" will not be assigned to a pointers. However, in function "expand" (alex_nodes.h), all elements of "children_" of a model node will be traversed, and then it will visit null pointers.

Is is right? It seems that we should check whether a elements of "children_" is null when we expand a model node.

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.