Giter Site home page Giter Site logo

streamly-lz4's Introduction

streamly-lz4

This library uses the LZ4 compression algorithm https://github.com/lz4/lz4 to compress and decompress a data stream using Haskell Streamly.

Running benchmarks

download-corpora.sh downloads and unpacks the caterbury corpus used for benchmarking.

Run the following commands from the top level directory in the repo:

$ ./download-corpora.sh
$ cabal run bench-lz4 -- --quick

To use fusion-plugin for benchmarks:

$ cabal run --flag fusion-plugin bench-lz4 -- --quick

Benchmarking an external corpus

BENCH_STREAMLY_LZ4_FILE is looked up for file to benchmark. This should contain the absolute path of the target file.

BENCH_STREAMLY_LZ4_STRATEGY is looked up for which benchmarking combinator to run on the target file.

It can contain values of the following structure,

  • c+speed+bufsize
  • d+bufsize
  • r+bufsize

c corresponds to compress, d for decompress and r for resizing.

Example,

$ export BENCH_STREAMLY_LZ4_FILE="path/to/file/"
$ export BENCH_STREAMLY_LZ4_STRATEGY="c+400+640000"
$ cabal bench

The commands above runs the compression bench suite on path/to/file/ with the acceleration value of 500 read as arrays of size 640000 bytes.

Note: For the decompression and resizing bench suite a compressed file is expected as input.

streamly-lz4's People

Contributors

adithyaov avatar harendra-kumar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

streamly-lz4's Issues

Improvements to benchmarks

  1. There are too many benchmarks, its difficult to analyze because of the noise. We should keep only those which provide us some useful information:
    a. remove the compress.decompress benchmarks. Usual use case is to either compress a stream or decompress a stream, it should be enough to measure these two separately.
    b. We can use 1 decompressResized benchmark for comparison how much difference resizing makes, we do not need all benchmarks to have these variants. Most of the time we would be resizing anyway.
    c. We can run 1 benchmark with multiple acceleration values to compare how much difference acceleration makes. We should not run all benchmark variants with all acceleration values.
  2. 64K block size would probably be optimal because the dictionary is maximum 64K. We can use a small block size (64 bytes), 64K, and a big block size (1 MB) for comparisons. We do not need to run all benchmarks with all buffer sizes. We can run one benchmark with all buffer sizes not all of them.
  3. Make all the input files of equal size so that we can compare the benchmarks. We can replicate the data to make them all 5MB or 10MB.
  4. Add a way to benchmark external files, like we have in streamly FileSystem.Handle benchmarks.

Provide support to read/write lz4 frames

See this document for the lz4 container format: https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md .

We can write a parser/terminating fold to decode a stream of frames into a stream of compressed blocks and decode the frame header and then decompress appropriately using the options/flags in the frame header. Similarly write a serializer to write a stream of compressed blocks as a lz4 frame using options provided by user.

Also, we have to handle the following in the block headers:

  • the uncompressed bit in the block header. Currently we error out in that case.
  • blocks with block checksum following the block

Specialized INLINE pragmas

In streamly, we use the following CPP helpers,

#define INLINE_EARLY  INLINE [2]
#define INLINE_NORMAL INLINE [1]
#define INLINE_LATE   INLINE [0]

This does not work well with a .hsc file.
This might not even be the right way to define these helpers in a .hsc file.
The closest, I can see in the documentation is by using #let.

This code is currently commented out but we should figure out a way to write informative INLINE pragmas.

Implementation plan

  • Initialize the project
  • Update the license in LICENSE
  • Create a README.md with the required info
    • Use place holders for Installation and Benchmarks
  • Create a Changelog
  • Create scaffolding for tests
  • Create scaffolding for benchmarks
  • Write the API with undefined implementation
  • Add minimal documentation to the API
  • Import any C code required and update the credits accordingly
  • Implement the API
  • Write basic tests for the API
  • Write real world tests for the API
  • Write real world benchmarks for the API
  • Check for passing tests
  • [ ] Compare benchmarks with native implementation
    • Performance related tasks can be done later
  • Check for proper naming and module structure
  • Clean the code
  • Add some implementation notes
  • Update the README.md
  • Add CI
  • Make any final changes
  • Clean the commits
  • Get the code reviewed

Use specific type cast for all uses of fromIntegral

For example, like this:

              else if ((fromIntegral :: Int32 -> Int) srcLen < Array.byteLength arr)

It is is easier to verify the safety, and forces us to think whether it is safe. Always try to upcast a smaller type rather than doing vice-versa.

Possibility of leaking `Ptr C_LZ4Stream` ?

https://github.com/composewell/streamly-lz4/blob/a929c20fb582da95783f84a48dc174204cb8601d/src/Streamly/Internal/LZ4.hs#L394C56-L394C56

In the code above, Ptr C_LZ4Stream would be freed ONLY after the following path (in reversed orderd):

  • c_freeStream was called
  • stream enter state, CompressDone ctx
  • got Stream.Stop from inner stream (The only way to produce CompressDone ctx)

Does that mean we'll got memory leak in code like:

readFile f & compressChunks cfg speed & take 0 & fold drain :: IO ()

since we'll never got Stream.Stop from readFile, or did I missing other path that would clean up Ptr C_LZ4Stream properly?

Use specific endianness for block headers

We should use a specific byte ordering instead of the machine byte ordering otherwise the code will not work across machines with different byte ordering. We can raise an issue for this and address it separately.

Unreproducable GHC 884 error

The CI for GHC 884 on Linux failed once. Unfortunately, the logs were not recorded and I don't know the right conditions to reproduce the error.

It was probably a memory issue but I'm not sure.

Benchmark names are too long

We should not use the whole absolute path of the corpus file in the benchmark name.

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/large/bible.txt.normalized time                 41.77 ms

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/large/world192.txt.normalized time                 33.97 ms

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/cantrbry/alice29.txt.normalized time                 39.86 ms

write compressChunk and decompressChunk as folds

compressChunk :: Int -> Fold m (Array.Array Word8) (Array.Array Word8)

-- Accept properly resized chunks
decompressChunk :: Fold m (Array.Array Word8) (Array.Array Word8)

We can then implement compress and decompress in terms of these folds.

Proposed compress and decompress APIs

Compress:

  • compress: A convenience API that uses a default speedup factor and a default block size for compression
  • compressWith: ability to specify speedup and block size
  • compressRaw: compress the arrays in the input stream as is without resizing

Decompress:

  • decompress: resize the input arrays and decompress
  • decompressRaw: assume the arrays are of the right size already

Rename the APIs

The compress/decompress APIs work on chunks/arrays. All the APIs in stream that work on arrays are suffixed with chunk. We should use the same convention here as well:

  • compressChunks: current compress
  • decompressChunks: current decompress
  • compress: compress a stream of Word8 to a stream of Word8
  • decompress: decompress a stream of Word8 consisting of compressed chunks

Try stream to stream compression/decompression APIs

For decompression, we could flatten the compressed input arrays into a stream and parse the stream into compressed chunks using a terminating fold. Check how this performs in comparison to array resizing.

For compression we could chunk the stream into arrays of 64K by default and provide an option to change the default.

We can provide both array as well as stream based APIs if arrays show some perf advantage otherwise we can keep just stream based ones.

Issue with fusion

All combinators fuse, is they are the only combinators in the pipeline.
Adding any other combinator breaks fusion.

The problem seems to be with stuff not getting inlined.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.