Giter Site home page Giter Site logo

Comments (8)

lemire avatar lemire commented on August 27, 2024

Yes, if you make all the necessary adjustments, in principle, it can be made to work.

from simdcompressionandintersection.

kreuzerkrieg avatar kreuzerkrieg commented on August 27, 2024

Looks like it is not. I guess the output should be identical in both libraries to be decoded correctly by the other one. The output is not identical. The Java output is longer by one int, both have different first int. the rest of data is identical to some point and then different data is written to the compressed int array. Will post the code right away.

from simdcompressionandintersection.

kreuzerkrieg avatar kreuzerkrieg commented on August 27, 2024

The dataset

std::vector<uint32_t> data = {
        99,    236,   566,   784,   928,   943,   1103,  1204,  2013,  2118,  2518,  3052,  3304,  3491,  3812,  4622,
        4817,  5085,  7335,  7352,  8162,  8365,  8573,  9060,  9084,  9945,  11367, 11436, 11512, 12375, 13405, 13704,
        16028, 16393, 16695, 17149, 18852, 19248, 19445, 19476, 21367, 22801, 23869, 24285, 25875, 26057, 26138, 26274,
        27303, 27325, 27808, 28150, 28519, 28630, 29092, 29411, 29822, 30231, 30402, 30654, 31646, 32091, 32469, 32476,
        32482, 32786, 32829, 33088, 33215, 34719, 34912, 35337, 35768, 35816, 35852, 36164, 37084, 37655, 37856, 38032,
        38288, 38317, 38564, 38893, 39037, 39197, 39370, 40314, 40538, 40595, 41353, 41424, 41973, 42053, 42484, 42868,
        44207, 44215, 44606, 44658, 45976, 46105, 46242, 47275, 48396, 48416, 48858, 49025, 49269, 49890, 50098, 50699,
        50844, 50986, 51173, 51780, 52398, 52925, 53262, 53357, 53667, 53876, 54104, 54875, 55119, 55785, 55819, 55914,
        56251, 56681, 56754, 57497, 57535, 57773, 58726, 58910, 58957, 59266, 59676, 59704, 60111, 60636, 61861, 62082,
        62099, 62446, 62531, 62575, 62706, 63589, 65319, 65347, 65608, 66145, 66250, 66675, 66989, 67258, 68044, 68259,
        68666, 68776, 69301, 71063, 71104, 71303, 72143, 72169, 72799, 73686, 73923, 74174, 74477, 75505, 76062, 77501,
        77926, 78299, 78455, 79760, 80431, 80661, 80820, 82402, 82854, 82874, 82979, 83091, 83098, 83305, 85269, 85594,
        85714, 86116, 86517, 86594, 86751, 87527, 88535, 88998, 89834, 89887, 93196, 93341, 93720, 93925, 94205, 94673,
        95210, 95234, 95855, 96505, 97246, 97347, 97677, 98713, 98755, 98910, 99097, 99791};

The same dataset is used in Java. the data variable is declared as int[] data = {...};
The C++ code

 auto codec = CODECFactory::getFromName("bp32");
 std::vector<uint32_t> compressed_output;
 compressed_output = codec->compress(data);

The Java code

IntegratedIntCompressor codec = new IntegratedIntCompressor();
int[] compressed;
compressed = codec.compress(data);
//just for inspection puroses
long[] ints = new long[compressed.length];
for (int i = 0; i < compressed.length; ++i) {
    ints[i] = Integer.toUnsignedLong(compressed[i]);
}

If you inspect compressed_output in C++ and ints in Java you can see it has 192 and 220 values respectively. Then, until 49th position the data is the same, from 49th and on it differs. The C++ vector length is 83 elements and the Java array has 84 elements.

from simdcompressionandintersection.

lemire avatar lemire commented on August 27, 2024

Sure. The output won't be identical, you need to make all the necessary adjustments. Lots of details may differ.

There is no claim to bit-by-bit format compatibility. If one needs this, then it needs to be engineered, tested and so forth.

If it is important to you, your code contributions would be welcome.

from simdcompressionandintersection.

kreuzerkrieg avatar kreuzerkrieg commented on August 27, 2024

Actually I dont care if the output is different until it could be decoded correctly on both sides.
Ok, so let me ask another question, the Java output which differs from C++ one, should be decoded by C++ uncompress and return the same dataset? Because this is not what I observe when trying to decode in C++ data loaded from a file which was written by Java - it fails with exception from checkifdivisibleby

from simdcompressionandintersection.

lemire avatar lemire commented on August 27, 2024

I don't think that there is any claim whatsoever that they are compatible, no. The code was written by different people at different times without any attempt at format compatibility. Why would there be?

It is essentially the same algorithm, so it is practical to make them compatible but this requires some programming effort. Your contributions are invited.

from simdcompressionandintersection.

lemire avatar lemire commented on August 27, 2024

If you are interested in Java-C++ compatibility, we have Masked VByte which is explicitly format compatible with Lucene's vInt format:

http://maskedvbyte.org/

from simdcompressionandintersection.

kreuzerkrieg avatar kreuzerkrieg commented on August 27, 2024

ok, now it much clearer, that it might work but not ensured. thanks for the link.
Maybe I will just wrap C++ implementation with JNI, looks like it is the easiest solution

from simdcompressionandintersection.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.