Giter Site home page Giter Site logo

Comments (14)

abreslow avatar abreslow commented on July 19, 2024 1

@jlgreathouse please see my pull request and its associated comments. Thanks. :-)

from morton_filter.

abreslow avatar abreslow commented on July 19, 2024

Hi Thomas,

Thank-you for being an early user of the code base. I deeply appreciate your time and effort.

Could you send me the full test code so that I can reproduce the error? What is the relative incidence of false positives (e.g., 1/1000000)? Does the problem only crop up after a specific load factor? Which version of the code are you using? How large are the filters that you are testing with?

I've been meaning to write a detailed series of unit tests. This might be a good opportunity to do that.

How urgently do you need this resolved? When would you like a response/answer by? I have an upcoming paper deadline on August 23 so if I could push this out a couple of days, that would be helpful to me.

I'm also curious why you are using the item-at-a-time interfaces. Does your use case preclude batching insertions and lookups? The batched interfaces provide much higher throughput for large filters and should be used when possible.

Thanks,
Alex

from morton_filter.

thomasmueller avatar thomasmueller commented on July 19, 2024

Here is a better test case, it is just changing benchmark.cc

int main(int argc, char** argv){
  size_t size = 10000000;
  CompressedCuckoo::Morton3_8* filter = new CompressedCuckoo::Morton3_8((size_t) (size / 0.50) + 64);
  for(size_t i = 0; i < size; i++) {
    uint64_t x = i;
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9L;
    x = (x ^ (x >> 27)) * 0x94d049bb133111ebL;
    x = x ^ (x >> 31);
    if (i % 100000 == 0) {
      printf("insert i=%lld key=%lld\n", (uint64_t) i, x);
    }
    filter->insert(x);
  }
  for(size_t i = 0; i < size; i++) {
    uint64_t x = i;
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9L;
    x = (x ^ (x >> 27)) * 0x94d049bb133111ebL;
    x = x ^ (x >> 31);
    if (i % 100000 == 0) {
      printf("lookup i=%lld key=%lld\n", (uint64_t) i, x);
    }
    if (!filter->likely_contains(x)) {
      printf("not found: %lld\n", x);
      return -1;
    }
  }
  printf("OK\n");
  // benchmark();
}

./benchmark

insert i=0 key=0
insert i=100000 key=-8804475191098573882
insert i=200000 key=-6885798598542860951
insert i=300000 key=5826581874738629586
insert i=400000 key=4675146885213764307
...
lookup i=0 key=0
lookup i=100000 key=-8804475191098573882
lookup i=200000 key=-6885798598542860951
lookup i=300000 key=5826581874738629586
lookup i=400000 key=4675146885213764307
...
not found: 1163028906654455684

I found that I get "bus error" if the load factor is 0.6 or higher, and with load factor of 0.5 I get often one entry not found (one false negative). False positives are as expected (I think), but I expect no false negatives as per the API contract.

How urgently do you need this resolved?

It's a good question... it would be good to have a workaround within one week. Maybe the batch insert doesn't have this problem?

I'm also curious why you are using the item-at-a-time interfaces.

For inserts, I can often use batching. For queries, my use case doesn't easily allow batching. It would require a large rewrite to use batching in some cases, and it's not possible to use it in all cases.

from morton_filter.

thomasmueller avatar thomasmueller commented on July 19, 2024

Unfortunately, bulk insert (insert_many) seems to have the same problem. Here a test case:

int main(int argc, char** argv){
  size_t size = 10000000;
  ::std::vector<::std::uint64_t> keys(size);
  uint64_t index = 0;
  auto genrand = [&index]() {
    uint64_t x = index++;
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9L;
    x = (x ^ (x >> 27)) * 0x94d049bb133111ebL;
    x = x ^ (x >> 31);
    return x;
  };
  ::std::generate(keys.begin(), keys.end(), ::std::ref(genrand));
  ::std::vector<bool> status(size);
  CompressedCuckoo::Morton3_8* filter = new CompressedCuckoo::Morton3_8((size_t) (size / 0.50) + 64);
  bool result = filter->insert_many(keys, status, size);
  printf("result = %d\n", result);
  for(size_t i = 0; i < size; i++) {
    if (!status[i]) {
      printf("status[%ld] = false\n", i);
    }
    if (i == 2273501 || keys[i] == 1163028906654455684LL) {
      printf("keys[%ld] = %lld\n", i, keys[i]);
    }
  }
  for(size_t i = 0; i < size; i++) {
    uint64_t x = i;
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9L;
    x = (x ^ (x >> 27)) * 0x94d049bb133111ebL;
    x = x ^ (x >> 31);
    if (i % 100000 == 0) {
      printf("lookup i=%lld key=%lld\n", (uint64_t) i, x);
    }
    if (!filter->likely_contains(x)) {
      printf("not found: i=%ld key=%lld\n", i, x);
      return -1;
    }
  }
  printf("OK\n");
  // benchmark();
}

Output for me:

./benchmark
result = 1
keys[2273501] = 1163028906654455684
lookup i=0 key=0
lookup i=100000 key=-8804475191098573882
...
lookup i=2200000 key=-8886166559463644051
not found: i=2273501 key=1163028906654455684    

from morton_filter.

abreslow avatar abreslow commented on July 19, 2024

Hi Thomas,
Thanks for these error reproducers. I will try to take a look at this by Wednesday to see if I can reproduce the error and hopefully have a fix by Friday.
-Alex

from morton_filter.

abreslow avatar abreslow commented on July 19, 2024

Thomas,

I'm having trouble reproducing the error on my Threadripper-based system with a clone of the master branch (latest commit 235531f). The cause of this error on your system could be a number of things. If you are using clang++, it could be that the code base is using undefined behavior that only manifests with that compiler. It could also be that you are using an old version of the code that defines Morton::Morton3_8 differently than present. I will test if the code breaks if the Block Fullness Array is enabled (in the VLDB Journal extension) or if resizing is enabled (also described in the VLDB Journal extension).

I get the following output (... for ellipsis) when using the vanilla code from the master branch:
...
lookup i=9700000 key=-304284872283253451
lookup i=9800000 key=-39302948963652970
lookup i=9900000 key=2387930219243299873
OK

Here's my OS information: Linux Jurassic 4.13.0-45-generic #50~16.04.1-Ubuntu
My compiler information:

g++ (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010

I add the following line to the Makefile:

$(CXX) $(INCLUDE) $(FLAGS) test.cc -o test

Here's the source for test.cc based on what you provided (placed in the morton_filter/benchmarking directory). Note that I upped 0.5 to 0.97 and it still runs fine.:

#include "morton_filter.h"
#include "./cuckoofilter/benchmarks/random.h"

int main(int argc, char** argv){
  size_t size = 10000000;
  ::std::vector<::std::uint64_t> keys(size);
  uint64_t index = 0;
  auto genrand = [&index]() {
    uint64_t x = index++;
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9L;
    x = (x ^ (x >> 27)) * 0x94d049bb133111ebL;
    x = x ^ (x >> 31);
    return x;
  };
  ::std::generate(keys.begin(), keys.end(), ::std::ref(genrand));
  ::std::vector<bool> status(size);
  CompressedCuckoo::Morton3_8* filter = new CompressedCuckoo::Morton3_8((size_t) (size / 0.97) + 64);
  bool result = filter->insert_many(keys, status, size);
  printf("result = %d\n", result);
  for(size_t i = 0; i < size; i++) {
    if (!status[i]) {
      printf("status[%ld] = false\n", i);
    }
    if (i == 2273501 || keys[i] == 1163028906654455684LL) {
      printf("keys[%ld] = %lld\n", i, keys[i]);
    }
  }
  for(size_t i = 0; i < size; i++) {
    uint64_t x = i;
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9L;
    x = (x ^ (x >> 27)) * 0x94d049bb133111ebL;
    x = x ^ (x >> 31);
    if (i % 100000 == 0) {
      printf("lookup i=%lld key=%lld\n", (uint64_t) i, x);
    }
    if (!filter->likely_contains(x)) {
      printf("not found: i=%ld key=%lld\n", i, x);
      return -1;
    }
  }
  printf("OK\n");
  // benchmark();
}

from morton_filter.

abreslow avatar abreslow commented on July 19, 2024

I checked and enabling the Block Full Array and enabling resizing using the test code that you provided do not cause the error to manifest when the code is compiled with g++.

Note that I am not saying that the bug does not exist: it may very well exist. I am just having a hard time reproducing it as this code shows (expected value is 100):

{ for i in `seq 1 100`; do ./test; done } | awk '/OK/{count+=1}; END{print count}'

The output

100

from morton_filter.

thomasmueller avatar thomasmueller commented on July 19, 2024

That's great news! So far I used LLVM (Mac OS). With g++ it works, well most of the time at least, with a load factor of 0.95. Sometimes I get this, when using a set size of 1 million: "corrupted size vs. prev_size". See also https://stackoverflow.com/questions/49628615/understanding-corrupted-size-vs-prev-size-glibc-error/49660707

I didn't try with older version of g++ / GCC.

So I will update the issue title to say LLVM.

from morton_filter.

thomasmueller avatar thomasmueller commented on July 19, 2024

It is no longer blocking me. You may want to keep this issue open if you are interested in solving the problem with clang / LLVM, or maybe improve documentation. It is perfectly fine for me if you close it.

from morton_filter.

abreslow avatar abreslow commented on July 19, 2024

I will keep it open. If I have more time and get access to a Mac OS system, I will try to port the main branch to support compilation and correct outputs with clang++ on Mac OS. If I recall, I use some functions from C's math library that support the constexpr qualifier when using g++ but not when using clang++. I don't know if newer versions of clang++ correct for this. When I ported to running a Clang, I saw a dip in performance. The g++-5.4 compiler gave better results, so I stuck with that. With regard to Clang, I don't recall if there were any bugs that I observed; this was a couple of months ago. However, given the ubiquity of Clang and Mac OS, I would like to support these platforms.

from morton_filter.

asl avatar asl commented on July 19, 2024

Indeed, the code contains UB.

Running both clang and gcc with -O1 -fsanitize=undefined yields:

compressed_cuckoo_filter.h:733:35: runtime error: shift exponent 128 is too large for 128-bit type '__uint128_t' (aka 'unsigned __int128')

This is for the following line of code:

const __uint128_t mask = (one << (_fullness_counter_width * counter_index)) - one;

Changing it to:

    const unsigned shift = _fullness_counter_width * counter_index;
    const __uint128_t mask = shift == 128 ? __uint128_t(-1) : (one << shift) - one;

fixes the clang issue for me.

from morton_filter.

abreslow avatar abreslow commented on July 19, 2024

@asl I would love to push your change, but I no longer have write permission to this repo. I will see if I can get one of my former colleagues to make the update. Thanks for your contribution.

from morton_filter.

jlgreathouse avatar jlgreathouse commented on July 19, 2024

For reference, you could submit a PR about this and ping me, I can accept the change.

from morton_filter.

abreslow avatar abreslow commented on July 19, 2024

from morton_filter.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.