Giter Site home page Giter Site logo

roaringformatspec's Introduction

RoaringFormatSpec : specification of the compressed-bitmap Roaring formats

Standard 32-bit Roaring Bitmap

Roaring bitmaps are used by several important systems:

Many of these systems use the following interoperable format.

This specification assumes that you are familiar with Roaring bitmaps. Please refer to the following paper for details on the design rationale:

Let us recap that Roaring bitmaps are designed to store sets of 32-bit (unsigned) integers. Thus a Roaring bitmap can contain up to 4294967296 integers. They are made of three types of 16-bit containers: array, bitset and run containers. There are between 1 and 65536 (inclusively) containers. Each container has a cardinality (value in [1, 65536]), and it has a 16-bit most significant value (also called "key") in [0,65536). All containers are non-empty.

General layout

All words are written using little endian encoding.

The layout is designed so that random access to the data is possible without materializing the bitmap in memory while being storage efficient.

  • There is an initial "cookie header" which allows us to recognize that the bit stream is a roaring bitmap and gather some minimal information;
  • The cookie header is followed by a "descriptive header" which describes the containers;
  • Depending on the cookie header, there is an offset header to allow fast random access to the stored containers;
  • The header is followed by the containers, serialized one by one.

Throughout, we use the following constants :

  • SERIAL_COOKIE_NO_RUNCONTAINER = 12346,
  • SERIAL_COOKIE = 12347,
  • NO_OFFSET_THRESHOLD = 4

The below diagram represents a storage file containing both run containers and the offset header.

Format diagram

1. Cookie header

The cookie header spans either 64 bits or 32 bits followed by a variable number of bytes.

  1. If the first 32 bits take the value SERIAL_COOKIE_NO_RUNCONTAINER, then no container part of the Roaring bitmap can be of type "run" (only array and bitset containers are allowed). When the cookie has this particular value, the next 32 bits are used to store an integer representing the number of containers. If the bitmap is empty (i.e., it has no container), then you should choose this cookie header. In this scenario, the cookie header uses 64 bits.
  2. The 16 least significant bits of the 32-bit cookie have value SERIAL_COOKIE. In that case, the 16 most significant bits of the 32-bit cookie are used to store the number of containers minus 1. That is, if you shift right by 16 the cookie and add 1, you get the number of containers. Let size be the number of containers. Then we store (size + 7) / 8 bytes, following the initial 32 bits, as a bitset to indicate whether each of the containers is a run container (bit set to 1) or not (bit set to 0). The first (least significant) bit of the first byte corresponds to the first stored container and so forth. In this scenario, the cookie header uses 32 bits followed by (size + 7) / 8 bytes.

Thus it follows that the least significant 16 bits of the first 32 bits of a serialized bitmaps should either have the value SERIAL_COOKIE_NO_RUNCONTAINER or the value SERIAL_COOKIE. In other cases, we should abort the decoding.

After scanning the cookie header, we know how many containers are present in the bitmap.

2. Descriptive header

The cookie header is followed by a descriptive header. For each container, we store the key (16 most significant bits) along with the cardinality minus 1, using 16 bits for each value (for a total of 32 bits per container).

Thus, if there are x containers, the descriptive header will contain 32 x bits or 4 x bytes.

After scanning the descriptive header, we know the type of each container. Indeed, if the cookie took value SERIAL_COOKIE, then we had a bitset telling us which containers are run containers; otherwise, we know that there are no run containers. For the containers that are not run containers, then we use the cardinality to determine the type: a cardinality of up and including 4096 indicates an array container whereas a cardinality above 4096 indicates a bitset container.

3. Offset header

If and only if one of these is true

  1. the cookie takes value SERIAL_COOKIE_NO_RUNCONTAINER
  2. the cookie takes the value SERIAL_COOKIE and there are at least NO_OFFSET_THRESHOLD containers,

then we store (using a 32-bit value) the location (in bytes) of the container from the beginning of the stream (starting with the cookie) for each container.

4. Container storage

The containers are then stored one after the other.

  • For array containers, we store a sorted list of 16-bit unsigned integer values corresponding to the array container. So if there are x values in the array container, 2 x bytes are used.
  • Bitset containers are stored using exactly 8KB using a bitset serialized with 64-bit words. Thus, for example, if value j is present, then word j/64 (starting at word 0) will have its (j%64) least significant bit set to 1 (starting at bit 0).
  • A run container is serialized as a 16-bit integer indicating the number of runs, followed by a pair of 16-bit values for each run. Runs are non-overlapping and sorted. Thus a run container with x runs will use 2 + 4 x bytes. Each pair of 16-bit values contains the starting index of the run followed by the length of the run minus 1. That is, we interleave values and lengths, so that if you have the values 11,12,13,14,15, you store that as 11,4 where 4 means that beyond 11 itself, there are 4 contiguous values that follow. Other example: e.g., 1,10, 20,0, 31,2 would be a concise representation of 1, 2, ..., 11, 20, 31, 32, 33

Testing

At a minimum, all Roaring implementations should be able to parse the two files in the testdata directory, see the associated README.md file found in the directory.

Reference implementations

Sample Java source code

    int startOffset = 0;
    boolean hasrun = hasRunContainer();
    if (hasrun) {
      out.writeInt(Integer.reverseBytes(SERIAL_COOKIE | ((size - 1) << 16)));
      byte[] bitmapOfRunContainers = new byte[(size + 7) / 8];
      for (int i = 0; i < size; ++i) {
        if (this.values[i] instanceof RunContainer) {
          bitmapOfRunContainers[i / 8] |= (1 << (i % 8));
        }
      }
      out.write(bitmapOfRunContainers);
      if (this.size < NO_OFFSET_THRESHOLD) {
        startOffset = 4 + 4 * this.size + bitmapOfRunContainers.length;
      } else {
        startOffset = 4 + 8 * this.size + bitmapOfRunContainers.length;
      }
    } else { // backwards compatibility
      out.writeInt(Integer.reverseBytes(SERIAL_COOKIE_NO_RUNCONTAINER));
      out.writeInt(Integer.reverseBytes(size));
      startOffset = 4 + 4 + 4 * this.size + 4 * this.size;
    }
    for (int k = 0; k < size; ++k) {
      out.writeShort(Short.reverseBytes(this.keys[k]));
      out.writeShort(Short.reverseBytes((short) (this.values[k].getCardinality() - 1)));
    }
    if ((!hasrun) || (this.size >= NO_OFFSET_THRESHOLD)) {
      // writing the containers offsets
      for (int k = 0; k < this.size; k++) {
        out.writeInt(Integer.reverseBytes(startOffset));
        startOffset = startOffset + this.values[k].getArraySizeInBytes();
      }
    }
    for (int k = 0; k < size; ++k) {
      values[k].writeArray(out);
    }

Sample C source code

    char *initbuf = buf;
    uint32_t startOffset = 0;
    bool hasrun = ra_has_run_container(ra);
    if (hasrun) {
        uint32_t cookie = SERIAL_COOKIE | ((ra->size - 1) << 16);
        memcpy(buf, &cookie, sizeof(cookie));
        buf += sizeof(cookie);
        uint32_t s = (ra->size + 7) / 8;
        uint8_t *bitmapOfRunContainers = (uint8_t *)calloc(s, 1);
        assert(bitmapOfRunContainers != NULL);  // todo: handle
        for (int32_t i = 0; i < ra->size; ++i) {
            if (get_container_type(ra->containers[i], ra->typecodes[i]) ==
                RUN_CONTAINER_TYPE_CODE) {
                bitmapOfRunContainers[i / 8] |= (1 << (i % 8));
            }
        }
        memcpy(buf, bitmapOfRunContainers, s);
        buf += s;
        free(bitmapOfRunContainers);
        if (ra->size < NO_OFFSET_THRESHOLD) {
            startOffset = 4 + 4 * ra->size + s;
        } else {
            startOffset = 4 + 8 * ra->size + s;
        }
    } else {  // backwards compatibility
        uint32_t cookie = SERIAL_COOKIE_NO_RUNCONTAINER;

        memcpy(buf, &cookie, sizeof(cookie));
        buf += sizeof(cookie);
        memcpy(buf, &ra->size, sizeof(ra->size));
        buf += sizeof(ra->size);

        startOffset = 4 + 4 + 4 * ra->size + 4 * ra->size;
    }
    for (int32_t k = 0; k < ra->size; ++k) {
        memcpy(buf, &ra->keys[k], sizeof(ra->keys[k]));
        buf += sizeof(ra->keys[k]);

        uint16_t card =
            container_get_cardinality(ra->containers[k], ra->typecodes[k]) - 1;
        memcpy(buf, &card, sizeof(card));
        buf += sizeof(card);
    }
    if ((!hasrun) || (ra->size >= NO_OFFSET_THRESHOLD)) {
        // writing the containers offsets
        for (int32_t k = 0; k < ra->size; k++) {
            memcpy(buf, &startOffset, sizeof(startOffset));
            buf += sizeof(startOffset);
            startOffset =
                startOffset +
                container_size_in_bytes(ra->containers[k], ra->typecodes[k]);
        }
    }
    for (int32_t k = 0; k < ra->size; ++k) {
        buf += container_write(ra->containers[k], ra->typecodes[k], buf);
    }

Unsigned integers in Java

Java lacks native unsigned integers, but integers are still considered to be unsigned within Roaring, and ordered according to  Integer.compareUnsigned.

For 32-bit integers in [0,2147483647], the unsigned and signed integers are undistinguisable since Java relies a 32-bit two's complement format for its int type.

Extension for 64-bit implementations

Some Roaring bitmap implementations may offer a 64-bit implementation. This section proposes a portable format, compatible with some (but not all) 64-bit implementations. This format is naturally compatible with implementations based on a conventional red-black-tree (as the serialization format is similar to the in-memory layout). The keys would be 32-bit integers representing the most significant 32~bits of elements whereas the values of the tree are 32-bit Roaring bitmaps. The 32-bit Roaring bitmaps represent the least significant bits of a set of elements.

General layout

All words are written using little endian encoding.

  • Write as long/uint64 the distinct number of buckets (restricted to integers in [0,4294967295]) (a.k.a the number of distinct keys being the most significant 32~bits of elements, leading to a four zero bytes padding).
  • Iterate through buckets ordered by increasing keys (as unsigned integers), for each:
  • first writing as int/uint32 the most significant 32~bits of the bucket (in [0,2^32-1])
  • second writing the 32-bit Roaring bitmaps representing the least significant bits of a set of elements

Unsigned longs in Java

Java lacks native unsigned longs, but longs are still considered to be unsigned within Roaring, and ordered according to Long.compareUnsigned.

For 64-bit integers in [0,18446744073709551616], the unsigned and signed longs are undistinguisable since Java relies a 64-bit two's complement format for its long type.

Alternative 64-bit implementations

Java Roaring bitmaps implementation offers an ART-based 64-bit implementation. It may reach better performances (compression and/or computation). But as of 2022-11, it is not compatible with this Serialization format.

Java Roaring bitmaps implementation offers an Map-based 64-bit implementation handling signed longs. It is not compatible with this serialization format (which does not handle signed keys).

roaringformatspec's People

Contributors

aaabramenko avatar alanbernstein avatar blacelle avatar dr-emann avatar lemire avatar mischief avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

roaringformatspec's Issues

Lack of clarity around what type a container with cardinality exactly 4096

then we use the cardinality to determine the type: a cardinality of up to 4096 indicates an array container whereas a cardinality above 4096 indicates a bitset container.

I think changing "up to" to "up to and including" would clarify what should happen with a cardinality of exactly 4096.

https://oneminuteenglish.org/en/up-to-inclusive-or-exclusive/ says

“Up to” can have an exclusive meaning when dealing with subjects that are more precise like maths and software development.
The best way to avoid confusion is to use “up to and including X” or “up to and not including X ”, where X is the upper limit.

About memory alignment of container values

Hey @lemire,

I am currently in the process of greatly improving the deserialization algorithm of the roaring-rs library. The binary serialization format is quite good. It allows the Bitmap types to borrow a reference to the memory from which it deserializes.

However, there is a little problem, a memory alignment problem. In Rust (and other languages), it is undefined behavior to read a type that isn't well aligned in memory, e.g., reading a bitmap, aligned on 64bits, at address 33 (33 % 8 is 1, not 0, therefore unaligned).

I have two solutions to this problem in mind:

  • The easy one could be to backup on copying the bytes of a container into an aligned buffer (a Vec) if those are not valid. This is a little bit frustrating as it could happen a lot.
  • The alternative is more a question: Do you think that I could use the header offsets when serializing to force the containers to be correctly aligned, introducing some garbage bytes between containers? Or do I need to keep no space between contiguous containers?

Thank you!

[question] any suggestion about compressing serialized data?

suppose i have multiple clients implemented in different languages(go, java...), and all of them need to download roaringbitmap from server through http.

when bitmap gets large(about 10 millions) and serialized binary data comes out about 20 MB, i think compress before sending may save a lot of transmission time.

i am trying use gzip. any suggestions about compressing?

thanks in advance

“The cookie header spans either 32 bits or 64 bits.” What does this sentence mean?

In Cookie header chapter, the first sentence is "The cookie header spans either 32 bits or 64 bits".

but in fact, cookie header in roaringBitmap use 8 bytes( 32bits) when has no runContainer, or either use ( 4 + (size + 7) / 8) bytes( uncertain bits) when has runContainer. so what's the meaning of the either 32 bits or 64 bits ? i'm a little confused,hope to get some explains.

may be, i got it. 32bits means the SERIAL_COOKIE | ((size - 1) << 16, it does not include runFlagBitset, it's not describes accurately in my view

Three questions about RoaringFormatSpec.

I hava three questions about the RoaringFormatSpec:

  • Question1: What does "The cookie header spans either 32 bits or 64 bits." mean?

    If the cookie takes the value SERIAL_COOKIE_NO_RUNCONTAINER, it will consume 64bits totally, and I understand it as the cookie header spans 64bits.

    But if the cookie takes the value SERIAL_COOKIE, as shown in the picture below, it will take 32bit to store SERIAL_COOKIE and Container count, but another (size + 7) / 8 bytes to store RUN infomation.
    image

    Then, what does "The cookie header spans either 32 bits or 64 bits." mean?

  • Question2: Why runFlagBitset take (size + 7) / 8 bytes, not size/8?

  • Question3: When will we use NO_OFFSET_THRESHOLD in Cookie Header?

Waiting for your reply, Thank you.

code doesn't match spec

The 16 least significant bits of the 32-bit cookie have value SERIAL_COOKIE. In that case, the 16 most significant bits of the 32-it cookie are used to store the number of containers minus 1. That is, if you shift right by 16 the cookie and add 1, you get the number of containers. Let size be the number of containers. Then we store (size + 7) / 8 bytes, following the initial 64 bits,

The sample Java code starts the isRun bitmap immediately after the 32-bit cookie (so after the first 32-bits), whereas the spec, in the paragraph quoted above, says it should come after the first 64-bits.

    int startOffset = 0;
    boolean hasrun = hasRunContainer();
    if (hasrun) {
      out.writeInt(Integer.reverseBytes(SERIAL_COOKIE | ((size - 1) << 16)));
      byte[] bitmapOfRunContainers = new byte[(size + 7) / 8];
      for (int i = 0; i < size; ++i) {
        if (this.values[i] instanceof RunContainer) {
          bitmapOfRunContainers[i / 8] |= (1 << (i % 8));
        }
      }
      out.write(bitmapOfRunContainers);

Endianness conversion

Hello,
RoaringBitmap is really nice. I think that the one thing that it might be missing is dealing with endianness esp when it comes to serialization. I think that as a part of the cookie headers, there should be a bit indicating whether the bitmap is little or big endian and then there should be method to check the endianness and also to convert it to and from little/big endian. I think that this functionality makes more sense in the library than outside of it.

Let me know what you think.

Wrong cookie header description in diagram.png.

In diagram.png we see 4 bytes for (Cookie) and 4 bytes for (Nc = Container count).
But in according to RoaringFormatSpec in case of bitmap has run containers description should be:

  • 2 bytes for (Cookie = SERIAL_COOKIE);
  • 2 bytes for (Nc = Container count - 1).
    This discrepancy makes confused.

code/data sample do not conform to the offset map sepc

  1. Offset header
    If and only if the cookie took value SERIAL_COOKIE and there are at least NO_OFFSET_THRESHOLD, then we store for each container (using a 32-bit value) to location (in bytes) of the container from the beginning of the stream (starting with the cookie).

the github.com/RoaringBitmap/roaring/testdata/bitmapwithoutruns.bin appears to violate the "if-and-only-if" part of the spec. serialCookie is not present, and yet the offset map is still present.

!serialCookie, so after skip of isRun bitmap, and read of size, now at pos = 8
....
skipping/should be no offset header, because !serialCookie  pos is now 52

at key 0, filepos 52, reading an array container with card 66

at key 0, filepos 52, read an array container (card 66) = {96, 0, 228, 0, 296, 0, 8488, 0, 16680, 0, 24872, 0, 33064, 0, 41256, 0, 48040, 0, 56232, 0, 64424, 0, 0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000, 21000, 22000, 23000, 24000, 25000, 26000, 27000, 28000, 29000, 30000, 31000, 32000, 33000, 34000, 35000, 36000, 37000, 38000, 39000, 40000, 41000, 42000, 4\
3000, }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.