Giter Site home page Giter Site logo

roaringbitmap / roaring Goto Github PK

View Code? Open in Web Editor NEW
2.4K 39.0 217.0 128.24 MB

Roaring bitmaps in Go (golang), used by InfluxDB, Bleve, DataDog

Home Page: http://roaringbitmap.org/

License: Apache License 2.0

Go 98.96% Assembly 0.67% Makefile 0.38%
go roaring-bitmaps bitmap-compression bitset indexing databases

roaring's Introduction

roaring

GoDoc Go Report Card

Go-CI Go-ARM-CI Go-Windows-CI

This is a go version of the Roaring bitmap data structure.

Roaring bitmaps are used by several major systems such as Apache Lucene and derivative systems such as Solr and Elasticsearch, Apache Druid (Incubating), LinkedIn Pinot, Netflix Atlas, Apache Spark, OpenSearchServer, anacrolix/torrent, Whoosh, Redpanda, Pilosa, Microsoft Visual Studio Team Services (VSTS), and eBay's Apache Kylin. The YouTube SQL Engine, Google Procella, uses Roaring bitmaps for indexing.

Roaring bitmaps are found to work well in many important applications:

Use Roaring for bitmap compression whenever possible. Do not use other bitmap compression methods (Wang et al., SIGMOD 2017)

The roaring Go library is used by

This library is used in production in several systems, it is part of the Awesome Go collection.

There are also Java and C/C++ versions. The Java, C, C++ and Go version are binary compatible: e.g, you can save bitmaps from a Java program and load them back in Go, and vice versa. We have a format specification.

This code is licensed under Apache License, Version 2.0 (ASL2.0).

Copyright 2016-... by the authors.

When should you use a bitmap?

Sets are a fundamental abstraction in software. They can be implemented in various ways, as hash sets, as trees, and so forth. In databases and search engines, sets are often an integral part of indexes. For example, we may need to maintain a set of all documents or rows (represented by numerical identifier) that satisfy some property. Besides adding or removing elements from the set, we need fast functions to compute the intersection, the union, the difference between sets, and so on.

To implement a set of integers, a particularly appealing strategy is the bitmap (also called bitset or bit vector). Using n bits, we can represent any set made of the integers from the range [0,n): the ith bit is set to one if integer i is present in the set. Commodity processors use words of W=32 or W=64 bits. By combining many such words, we can support large values of n. Intersections, unions and differences can then be implemented as bitwise AND, OR and ANDNOT operations. More complicated set functions can also be implemented as bitwise operations.

When the bitset approach is applicable, it can be orders of magnitude faster than other possible implementation of a set (e.g., as a hash set) while using several times less memory.

However, a bitset, even a compressed one is not always applicable. For example, if you have 1000 random-looking integers, then a simple array might be the best representation. We refer to this case as the "sparse" scenario.

When should you use compressed bitmaps?

An uncompressed BitSet can use a lot of memory. For example, if you take a BitSet and set the bit at position 1,000,000 to true and you have just over 100kB. That is over 100kB to store the position of one bit. This is wasteful even if you do not care about memory: suppose that you need to compute the intersection between this BitSet and another one that has a bit at position 1,000,001 to true, then you need to go through all these zeroes, whether you like it or not. That can become very wasteful.

This being said, there are definitively cases where attempting to use compressed bitmaps is wasteful. For example, if you have a small universe size. E.g., your bitmaps represent sets of integers from [0,n) where n is small (e.g., n=64 or n=128). If you can use uncompressed BitSet and it does not blow up your memory usage, then compressed bitmaps are probably not useful to you. In fact, if you do not need compression, then a BitSet offers remarkable speed.

The sparse scenario is another use case where compressed bitmaps should not be used. Keep in mind that random-looking data is usually not compressible. E.g., if you have a small set of 32-bit random integers, it is not mathematically possible to use far less than 32 bits per integer, and attempts at compression can be counterproductive.

How does Roaring compares with the alternatives?

Most alternatives to Roaring are part of a larger family of compressed bitmaps that are run-length-encoded bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word. If you have a local mix of 1s and 0, you use an uncompressed word.

There are many formats in this family:

  • Oracle's BBC is an obsolete format at this point: though it may provide good compression, it is likely much slower than more recent alternatives due to excessive branching.
  • WAH is a patented variation on BBC that provides better performance.
  • Concise is a variation on the patented WAH. It some specific instances, it can compress much better than WAH (up to 2x better), but it is generally slower.
  • EWAH is both free of patent, and it is faster than all the above. On the downside, it does not compress quite as well. It is faster because it allows some form of "skipping" over uncompressed words. So though none of these formats are great at random access, EWAH is better than the alternatives.

There is a big problem with these formats however that can hurt you badly in some cases: there is no random access. If you want to check whether a given value is present in the set, you have to start from the beginning and "uncompress" the whole thing. This means that if you want to intersect a big set with a large set, you still have to uncompress the whole big set in the worst case...

Roaring solves this problem. It works in the following manner. It divides the data into chunks of 216 integers (e.g., [0, 216), [216, 2 x 216), ...). Within a chunk, it can use an uncompressed bitmap, a simple list of integers, or a list of runs. Whatever format it uses, they all allow you to check for the presence of any one value quickly (e.g., with a binary search). The net result is that Roaring can compute many operations much faster than run-length-encoded formats like WAH, EWAH, Concise... Maybe surprisingly, Roaring also generally offers better compression ratios.

References

  • Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai, Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience 48 (4), 2018 arXiv:1709.07821
  • Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin, Better bitmap performance with Roaring bitmaps, Software: Practice and Experience 46 (5), 2016.arXiv:1402.6407 This paper used data from http://lemire.me/data/realroaring2014.html
  • Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser, Consistently faster and smaller compressed bitmaps with Roaring, Software: Practice and Experience 46 (11), 2016. arXiv:1603.06549

Dependencies

Dependencies are fetched automatically by giving the -t flag to go get.

they include

  • github.com/bits-and-blooms/bitset
  • github.com/mschoch/smat
  • github.com/glycerine/go-unsnap-stream
  • github.com/philhofer/fwd
  • github.com/jtolds/gls

Note that the smat library requires Go 1.6 or better.

Installation

  • go get -t github.com/RoaringBitmap/roaring

Instructions for contributors

Using bash or other common shells:

$ git clone [email protected]:RoaringBitmap/roaring.git
$ export GO111MODULE=on 
$ go mod tidy
$ go test -v

Example

Here is a simplified but complete example:

package main

import (
    "fmt"
    "github.com/RoaringBitmap/roaring"
    "bytes"
)


func main() {
    // example inspired by https://github.com/fzandona/goroar
    fmt.Println("==roaring==")
    rb1 := roaring.BitmapOf(1, 2, 3, 4, 5, 100, 1000)
    fmt.Println(rb1.String())

    rb2 := roaring.BitmapOf(3, 4, 1000)
    fmt.Println(rb2.String())

    rb3 := roaring.New()
    fmt.Println(rb3.String())

    fmt.Println("Cardinality: ", rb1.GetCardinality())

    fmt.Println("Contains 3? ", rb1.Contains(3))

    rb1.And(rb2)

    rb3.Add(1)
    rb3.Add(5)

    rb3.Or(rb1)

    // computes union of the three bitmaps in parallel using 4 workers  
    roaring.ParOr(4, rb1, rb2, rb3)
    // computes intersection of the three bitmaps in parallel using 4 workers  
    roaring.ParAnd(4, rb1, rb2, rb3)


    // prints 1, 3, 4, 5, 1000
    i := rb3.Iterator()
    for i.HasNext() {
        fmt.Println(i.Next())
    }
    fmt.Println()

    // next we include an example of serialization
    buf := new(bytes.Buffer)
    rb1.WriteTo(buf) // we omit error handling
    newrb:= roaring.New()
    newrb.ReadFrom(buf)
    if rb1.Equals(newrb) {
    	fmt.Println("I wrote the content to a byte stream and read it back.")
    }
    // you can iterate over bitmaps using ReverseIterator(), Iterator, ManyIterator()
}

If you wish to use serialization and handle errors, you might want to consider the following sample of code:

	rb := BitmapOf(1, 2, 3, 4, 5, 100, 1000)
	buf := new(bytes.Buffer)
	size,err:=rb.WriteTo(buf)
	if err != nil {
		t.Errorf("Failed writing")
	}
	newrb:= New()
	size,err=newrb.ReadFrom(buf)
	if err != nil {
		t.Errorf("Failed reading")
	}
	if ! rb.Equals(newrb) {
		t.Errorf("Cannot retrieve serialized version")
	}

Given N integers in [0,x), then the serialized size in bytes of a Roaring bitmap should never exceed this bound:

8 + 9 * ((long)x+65535)/65536 + 2 * N

That is, given a fixed overhead for the universe size (x), Roaring bitmaps never use more than 2 bytes per integer. You can call BoundSerializedSizeInBytes for a more precise estimate.

64-bit Roaring

By default, roaring is used to stored unsigned 32-bit integers. However, we also offer an extension dedicated to 64-bit integers. It supports roughly the same functions:

package main

import (
    "fmt"
    "github.com/RoaringBitmap/roaring/roaring64"
    "bytes"
)


func main() {
    // example inspired by https://github.com/fzandona/goroar
    fmt.Println("==roaring64==")
    rb1 := roaring64.BitmapOf(1, 2, 3, 4, 5, 100, 1000)
    fmt.Println(rb1.String())

    rb2 := roaring64.BitmapOf(3, 4, 1000)
    fmt.Println(rb2.String())

    rb3 := roaring64.New()
    fmt.Println(rb3.String())

    fmt.Println("Cardinality: ", rb1.GetCardinality())

    fmt.Println("Contains 3? ", rb1.Contains(3))

    rb1.And(rb2)

    rb3.Add(1)
    rb3.Add(5)

    rb3.Or(rb1)



    // prints 1, 3, 4, 5, 1000
    i := rb3.Iterator()
    for i.HasNext() {
        fmt.Println(i.Next())
    }
    fmt.Println()

    // next we include an example of serialization
    buf := new(bytes.Buffer)
    rb1.WriteTo(buf) // we omit error handling
    newrb:= roaring64.New()
    newrb.ReadFrom(buf)
    if rb1.Equals(newrb) {
    	fmt.Println("I wrote the content to a byte stream and read it back.")
    }
    // you can iterate over bitmaps using ReverseIterator(), Iterator, ManyIterator()
}

Only the 32-bit roaring format is standard and cross-operable between Java, C++, C and Go. There is no guarantee that the 64-bit versions are compatible.

Documentation

Current documentation is available at https://pkg.go.dev/github.com/RoaringBitmap/roaring and https://pkg.go.dev/github.com/RoaringBitmap/roaring/roaring64

Goroutine safety

In general, it should not generally be considered safe to access the same bitmaps using different goroutines--they are left unsynchronized for performance. Should you want to access a Bitmap from more than one goroutine, you should provide synchronization. Typically this is done by using channels to pass the *Bitmap around (in Go style; so there is only ever one owner), or by using sync.Mutex to serialize operations on Bitmaps.

Coverage

We test our software. For a report on our test coverage, see

https://coveralls.io/github/RoaringBitmap/roaring?branch=master

Benchmark

Type

     go test -bench Benchmark -run -

To run benchmarks on Real Roaring Datasets run the following:

go get github.com/RoaringBitmap/real-roaring-datasets
BENCH_REAL_DATA=1 go test -bench BenchmarkRealData -run -

Iterative use

You can use roaring with gore:

  • go get -u github.com/motemen/gore
  • Make sure that $GOPATH/bin is in your $PATH.
  • go get github.com/RoaringBitmap/roaring
$ gore
gore version 0.2.6  :help for help
gore> :import github.com/RoaringBitmap/roaring
gore> x:=roaring.New()
gore> x.Add(1)
gore> x.String()
"{1}"

Fuzzy testing

You can help us test further the library with fuzzy testing:

     go get github.com/dvyukov/go-fuzz/go-fuzz
     go get github.com/dvyukov/go-fuzz/go-fuzz-build
     go test -tags=gofuzz -run=TestGenerateSmatCorpus
     go-fuzz-build github.com/RoaringBitmap/roaring
     go-fuzz -bin=./roaring-fuzz.zip -workdir=workdir/ -timeout=200 -func FuzzSmat

Let it run, and if the # of crashers is > 0, check out the reports in the workdir where you should be able to find the panic goroutine stack traces.

You may also replace -func FuzzSmat by -func FuzzSerializationBuffer or -func FuzzSerializationStream.

Alternative in Go

There is a Go version wrapping the C/C++ implementation https://github.com/RoaringBitmap/gocroaring

For an alternative implementation in Go, see https://github.com/fzandona/goroar The two versions were written independently.

Mailing list/discussion group

https://groups.google.com/forum/#!forum/roaring-bitmaps

roaring's People

Contributors

alldroll avatar anacrolix avatar askalexsharov avatar betrok avatar bpot avatar cornelk avatar damnever avatar davidchen-cn avatar dgryski avatar fzerorubigd avatar gamolina avatar glycerine avatar guymolinari avatar jacobmarble avatar jnewhouse avatar joenall avatar kevinconaway avatar lemire avatar maciej avatar marreck avatar neena avatar oppen avatar richardartoul avatar statik avatar steveyen avatar testwill avatar tgruben avatar tsenart avatar ttsugriy avatar willglynn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

roaring's Issues

Add benchmarks

In theory, adding benchmarks in Go is trivial. It is as easy as adding a test.

Unfortunately, Convey seems to get in the way: whenever I try to run a single benchmark, Convey insists on running all the tests, which takes 2 minutes or so...

I do not know Convey well enough to understand how well it supports benchmarks.

To understand what I have in mind, grab https://github.com/willf/bitset and try...

  go test -bench=Iterate

It will run a single benchmark and nothing else. This very handy when optimizing the code. I cannot make this work with Convey.

runContainer16 efficiency: avoid O(N^2) in runContainer16 interval subtraction

https://github.com/RoaringBitmap/roaring/pull/70/files#diff-fd38fd3c1b1dffc8f36af21de50228f5R1204

line 1203 of rle16.go, at the top of isubtract() I make a copy of the backing slice rc.iv:

func (rc *runContainer16) isubtract(del interval16) {
	origiv := make([]interval16, len(rc.iv))
	copy(origiv, rc.iv)
       ...

The copy of rc.iv into origiv is needed for correctness, because subtracting an interval can actually split an existing interval in two and make the overall slice grow.

Hence under the current implementation using a backing slice of intervals, a pathological subtraction of a set of intervals could cause O(N^2) time. For large sets a red-black tree, e.g. https://github.com/glycerine/rbtree, or converting to a bitmapContainer first and doing the subtraction there, is probably faster; but needs to be measured.

Issue with AddMany

Hello,

I think i've found a bug in AddMany & BitmapOf methods.
Here is a small test:

package libtest

import (
	"github.com/RoaringBitmap/roaring"
	"testing"
)

var array = []uint32{5580, 33722, 44031, 57276, 83097}

func TestRoaringBitmapBitmapOf(t *testing.T) {

	bmp := roaring.BitmapOf(array...)
	if len(array) != int(bmp.GetCardinality()) {
		t.Errorf("length diff %d!=%d", len(array), bmp.GetCardinality())
		t.FailNow()
	}
}

func TestRoaringBitmapAdd(t *testing.T) {

	bmp := roaring.NewBitmap()
	for _, v := range array {
		bmp.Add(v)
	}
	if len(array) != int(bmp.GetCardinality()) {
		t.Errorf("length diff %d!=%d", len(array), bmp.GetCardinality())
		t.FailNow()
	}
}

func TestRoaringBitmapAddMany(t *testing.T) {

	bmp := roaring.NewBitmap()
	bmp.AddMany(array)
	if len(array) != int(bmp.GetCardinality()) {
		t.Errorf("length diff %d!=%d", len(array), bmp.GetCardinality())
		t.FailNow()
	}
}

Result:

--- FAIL: TestRoaringBitmapBitmapOf (0.00s)
	roaring_test.go:17: length diff 5!=4
--- FAIL: TestRoaringBitmapAddMany (0.00s)
	roaring_test.go:39: length diff 5!=4
FAIL

new rangeOnfOnes implementation induces multiple test suite failures

when roaringarray.go's func rangeOfOnes(start, last int) container is changed to be implemented purely with bitmaps, arrays, or rle16 containers, I see test suite failure. Here are the 3 options:

For context: this

https://github.com/RoaringBitmap/roaring/blob/master/roaringarray.go#L42

is being replaced by:

 func rangeOfOnes(start, last int) container {
       return newBitmapContainerwithRange(start, last) // option 1
       //return newArrayContainerRange(start, last) // option 2
       //return newRunContainer16Range(uint64(start), uint64(last)) // option3
 }

as in glycerine#1

and here are the suite fails, which are different in each case:

bitmaps for runs, test failures:

func TestDoubleAndNotBug01(t *testing.T)  fails, line 1681

Failures:

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 1681:
  Expected: true
  Actual:   false

----------

arrays for runs: 2 distinct fails:

Failures:

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 1616:
  Expected: 'true'
  Actual:   'false'
  (Should be equal)


--- FAIL: TestDoubleAdd2 (0.01s)

Failures:

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 1629:
  Expected: 'true'
  Actual:   'false'
  (Should be equal)



--- FAIL: TestDoubleAdd3 (0.01s)

----------
with rle16:


Failures:

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 787:
  Expected: '96617'
  Actual:   '9'
  (Should be equal)



Failures:

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 865:
  Expected: '35392'
  Actual:   '1928'
  (Should be equal)

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 886:
  Expected: '131999'
  Actual:   '131351'
  (Should be equal)

Failures:

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 933:
  Expected: 'true'
  Actual:   'false'
  (Should be equal)

Failures:

  * /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go 
  Line 1589:
  Expected: 'true'
  Actual:   'false'
  (Should be equal)


--- FAIL: TestFlipBigA (0.12s)

Parallelize

Roaring is ideally suited for parallel execution, as all containers can be operated upon in distinct threads.

Implement run containers

For better compression when there are long runs of consecutive values, the Java version has run containers (as of version 0.5). They should be implemented :

https://github.com/lemire/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/RunContainer.java

These can be created by the "runOptimize" function...

https://github.com/lemire/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/RoaringBitmap.java#L1286

These containers can improve the compression ratio in some cases.

Operations like Or fail race checks

In practice this may not be dangerous, but concurrently performing operations like Or(rb1, rb2) triggers warnings in Go's race detector due to concurrently writing the roaringarray.needCopyOnWrite flags on the argument bitmaps.

Example test:

func TestConcurrentBitmapAccess(t *testing.T) {
	rb1 := roaring.BitmapOf(0, 1, 10)

	var wg sync.WaitGroup
	for i := 0; i < 100; i++ {
		wg.Add(1)
		go func() {
			rb2 := roaring.NewBitmap()
			rb2.Or(rb1)
			wg.Done()
		}()
	}
	wg.Wait()
}

Run via go test -race and you should see race warnings like:

==================
WARNING: DATA RACE
Write at 0x00c4201436e8 by goroutine 9:
  github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*roaringArray).appendCopyMany()
      /Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaringarray.go:91 +0x3f6
  github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.Or()
      /Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:686 +0xa73
  github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*Bitmap).Or()
      /Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:578 +0x46
  github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess.func1()
      /Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:67 +0x2e5

Previous write at 0x00c4201436e8 by goroutine 8:
  github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*roaringArray).appendCopyMany()
      /Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaringarray.go:91 +0x3f6
  github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.Or()
      /Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:686 +0xa73
  github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*Bitmap).Or()
      /Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:578 +0x46
  github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess.func1()
      /Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:67 +0x2e5

Goroutine 9 (running) created at:
  github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess()
      /Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:69 +0x115
  testing.tRunner()
      /usr/local/Cellar/go/1.7.1/libexec/src/testing/testing.go:610 +0xc9

Goroutine 8 (finished) created at:
  github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess()
      /Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:69 +0x115
  testing.tRunner()
      /usr/local/Cellar/go/1.7.1/libexec/src/testing/testing.go:610 +0xc9
==================

A workaround is to call rb.Clone() before using in any other operations in a separate goroutine.

Possible solutions:

  1. Leave as is but clarify this behavior in documentation. The current documentation gives me the impression that reading bitmaps in concurrent goroutines should be safe if I haven't set them to be copy-on-write. The race warning suggests otherwise. This may be wrong (the racing goroutines are setting the memory to the same value) but at minimum it's worrying and some doc mention would be helpful.

  2. Use atomic loads/stores when accessing that flag. At minimum using an atomic store to implement roaringarray.setNeedsCopyOnWrite fixes this particular race for me; I'm not sure if other accesses of the flag need similar treatment.

  3. Have operations like Or use appendWithoutCopy and appendWithoutCopyMany if the argument bitmaps aren't copy-on-write. (I haven't thought about what the performance indications are here.)

Optimize the implementation of OrCardinality

Currently, the OrCardinality function materializes temporary containers that are immediately (?) garbage collected. This is wasteful. Instead, we should have fast container-to-container union functions that just return the cardinality without doing memory allocation, as is the case with the intersections.

// TODO: could be faster if we did not have to materialize the container

c.c. @tgruben @statik @bpot @tvmaly

thoughts on serialization formats

Thoughts on serialization... things that could save a bunch of time...

When I need quick and portable serialization, I typically use msgpack2 from http://msgpack.org. It's a binary version of JSON that has a readable spec and is portable to all languages.

A reason I like msgpack2: there is an easy to use library for Go that is very fast.
Fast to deploy and fast when reading/writing.

https://github.com/tinylib/msgp in particular will parse your .go files and do codegen for your Go structs, making setting up serialization a 30 second task. All that is required is that your struct fields be Exported (capitalized).

Usually compression isn't needed afterwards, depending on the entropy of your data. You have to measure, but sometimes it is a win. If desired, then snappy can be easily applied without sacrificing decoding speed. Other general purpose compressors like gzip and bzip will sometimes compress slightly more, but are super slow to decode, hogging cpu and wasting valuable time at runtime. I usually use my implementation of the streaming snappy format here, https://github.com/glycerine/go-unsnap-stream, as it comes with nice command line tools (snap and unsnap). Since Google open sourced snapped (http://google.github.io/snappy/) quite a while ago, there are libraries for most languages for this as well.

Finally, one other nice thing about msgpack2 is that you don't need to predefine a schema; like JSON the data is self describing. It gives you some data-evolution capability (based of field names rather than field numbers) while being dynamic language friendly. There is no need for a pre-compiler (as in protobufs/thrift/capnproto) and java and python can read msgpack2 with the available libraries. msgpack2 is quite a bit more strongly typed that JSON however, and you can quickly move binary data without a whole lot of base64 encoding fuss.

That said, if one really wants a schema for future proofing the data evolution, then protobufs are also highly tuned and have libraries for the big 4 (Go, C++, Java, python). Also I used to maintain a capnproto implementation for Go, so I can discuss that if it is of interest. However the java bindings were never finished, so I didn't really consider capnproto a choice here where java support is expected.

AddRange pattern that panics

This sequence of calls causes a panic on linux/arm and darwin/amd64

bm := NewRoaringBitmap()
bm.AddRange(21, 26)
bm.AddRange(9, 14)
bm.AddRange(11, 16)

Flip on an empty bitmap does not set bits

package main

import (
"fmt"
"github.com/roaring"
)

func main() {
bm := roaring.NewRoaringBitmap()
roaring.Flip(bm, 0, 10)
fmt.Printf("%v cardinality is: %d\n", bm, bm.GetCardinality())
}

Replace short by uint16

In Java, we are stuck with short, int, long. The Java roaring library effectively emulates uint16 using short (a major pain). However, go has exactly the right native type (uint16).

Unless there is a very good reason for it, I recommend that the type "short" be replaced by uint16 for clarity.

(I could make this change, naturally... but I try to avoid doing such far ranging changes without discussing it first.)

Q: recommended way to intersect with a contiguous interval/time span?

A common filter, used in 90% of my queries, is to restrict a document set to a [begin,end) interval or time-window.

If I was to construct document IDs in chronological order, then this filter can be efficiently represented by a [first, last) interval of two integers.

Q: what is the efficient way to represent/construct such a bitset in Roaring, and use it in intersections?

Wrinkle: I typically have 100e6 documents, so it seems super inefficient (but please correct me if I'm wrong) to do the following:

func CreateContinguousSpanBitmap(begTime, endTime uint64, chronoOrderedDocs []Doc) *roaring.Bitmap {
 b := roaring.NewBitmap()
 n := len(chronoOrderedDocs)
 for i:=uint32(0); i <  n; i++ {
   b.Add(i)
 }
 b.RemoveRange(begTime, endTime) 
 b.Flip(0, n)
 return b
}

There's got to be a faster way, no?

use arrays for dense blocks

the cython implementation by @andreasvc has a nice idea from lucene "that it uses arrays not only when a block contains less than 2 ** 12 elements, but also when it contains more than 2 ** 32 - 2 ** 12 elements; i.e., blocks that are mostly full are stored just as compactly as blocks that are mostly empty"

it would be nice to support that in this version.

possible Flip() bug?

Hi, thanks for the cool work on roaring bitmaps.

Here's something strange that I originally encountered while trying to write some automated fuzz testing for roaring.

It's possible I'm using the Flip() API incorrectly, or the bug is really in the willf.BitSet that I'm comparing roaring against, but please see here for a simple test...

https://github.com/steveyen/roaring/blob/master/roaring_test.go#L1685

        bm := roaring.NewBitmap()
        bs := bitset.New(0)

        bm.AddInt(0)
        bs = bs.Set(0)

        bm.Flip(1, 2)
        bs = bs.Flip(1)

        So(equalsBitSet(bs, bm), ShouldEqual, true) // Fails here.
}

How to count all set bits in a bitmap before an index?

Hi,

just wanted to ask for some advice before trying the code out. Say I have a Roaring Bitmap which can contain any unsigned integer below 4,294,967,296. It will be very sparsely populated and the highest integer may be much lower than the maximum. I want to count the set bits below an index. Is the recommended solution to make a new RoaringBitmap of length equal to the index-1 and AND it with the main bitmap and then do a count of the resulting bitmap?

Any advice much appreciated on this, or any other approach. Performance will be the key factor for my use case.

Thanks!

Rename repo

You have named the package goroaring. I approve even though the simpler "roaring" could be even better.

There is a real problem however with the repo name (go-roaring) on github. This violates one of the Go's convention that the name of the repository should match the package name:

http://golang.org/doc/effective_go.html#package-names

tune for performance

often times in the run container operations it appeared faster to convert to bitmaps and the do the operations there. Meaure and see if any of the remaining operations could benefit from this kind of tuning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.