Light

composewell / streamly-lz4 Goto Github PK

View Code? Open in Web Editor NEW

5.0 4.0 0.0 241 KB

Streamly combinators for LZ4 compression.

License: Apache License 2.0

Haskell 27.20% C 70.65% Nix 1.57% Shell 0.58%

streamly compression lz4

streamly-lz4's Introduction

streamly-lz4

This library uses the LZ4 compression algorithm https://github.com/lz4/lz4 to compress and decompress a data stream using Haskell Streamly.

Running benchmarks

download-corpora.sh downloads and unpacks the caterbury corpus used for benchmarking.

Run the following commands from the top level directory in the repo:

$ ./download-corpora.sh
$ cabal run bench-lz4 -- --quick

To use fusion-plugin for benchmarks:

$ cabal run --flag fusion-plugin bench-lz4 -- --quick

Benchmarking an external corpus

BENCH_STREAMLY_LZ4_FILE is looked up for file to benchmark. This should contain the absolute path of the target file.

BENCH_STREAMLY_LZ4_STRATEGY is looked up for which benchmarking combinator to run on the target file.

It can contain values of the following structure,

c+speed+bufsize
d+bufsize
r+bufsize

c corresponds to compress, d for decompress and r for resizing.

Example,

$ export BENCH_STREAMLY_LZ4_FILE="path/to/file/"
$ export BENCH_STREAMLY_LZ4_STRATEGY="c+400+640000"
$ cabal bench

The commands above runs the compression bench suite on path/to/file/ with the acceleration value of 500 read as arrays of size 640000 bytes.

Note: For the decompression and resizing bench suite a compressed file is expected as input.

streamly-lz4's People

Contributors

Stargazers

Watchers

streamly-lz4's Issues

Improvements to benchmarks

There are too many benchmarks, its difficult to analyze because of the noise. We should keep only those which provide us some useful information:
a. remove the compress.decompress benchmarks. Usual use case is to either compress a stream or decompress a stream, it should be enough to measure these two separately.
b. We can use 1 decompressResized benchmark for comparison how much difference resizing makes, we do not need all benchmarks to have these variants. Most of the time we would be resizing anyway.
c. We can run 1 benchmark with multiple acceleration values to compare how much difference acceleration makes. We should not run all benchmark variants with all acceleration values.
64K block size would probably be optimal because the dictionary is maximum 64K. We can use a small block size (64 bytes), 64K, and a big block size (1 MB) for comparisons. We do not need to run all benchmarks with all buffer sizes. We can run one benchmark with all buffer sizes not all of them.
Make all the input files of equal size so that we can compare the benchmarks. We can replicate the data to make them all 5MB or 10MB.
Add a way to benchmark external files, like we have in streamly FileSystem.Handle benchmarks.

Restrict block size when compressing.

Check the size of the input array and fail if it is more than the max block size set in BlockConfig. Otherwise decompression would fail.

Test extreme block size cases for compression/decompression

Test the following cases:

in decompressResized
- array length is less than the length specified in the header of the block
- array length is = 0, 2GB, > 2GB
in compress
- array length is = 0, 2GB, > 2GB

validateFooter should always print a relevant error.

See #29 (comment)

Provide support to read/write lz4 frames

See this document for the lz4 container format: https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md .

We can write a parser/terminating fold to decode a stream of frames into a stream of compressed blocks and decode the frame header and then decompress appropriately using the options/flags in the frame header. Similarly write a serializer to write a stream of compressed blocks as a lz4 frame using options provided by user.

Also, we have to handle the following in the block headers:

the uncompressed bit in the block header. Currently we error out in that case.
blocks with block checksum following the block

Specialized INLINE pragmas

In streamly, we use the following CPP helpers,

#define INLINE_EARLY  INLINE [2]
#define INLINE_NORMAL INLINE [1]
#define INLINE_LATE   INLINE [0]

This does not work well with a .hsc file.
This might not even be the right way to define these helpers in a .hsc file.
The closest, I can see in the documentation is by using #let.

This code is currently commented out but we should figure out a way to write informative INLINE pragmas.

Implementation plan

Use specific type cast for all uses of fromIntegral

For example, like this:

              else if ((fromIntegral :: Int32 -> Int) srcLen < Array.byteLength arr)

It is is easier to verify the safety, and forces us to think whether it is safe. Always try to upcast a smaller type rather than doing vice-versa.

Possibility of leaking `Ptr C_LZ4Stream` ?

https://github.com/composewell/streamly-lz4/blob/a929c20fb582da95783f84a48dc174204cb8601d/src/Streamly/Internal/LZ4.hs#L394C56-L394C56

In the code above, Ptr C_LZ4Stream would be freed ONLY after the following path (in reversed orderd):

c_freeStream was called
stream enter state, CompressDone ctx
got Stream.Stop from inner stream (The only way to produce CompressDone ctx)

Does that mean we'll got memory leak in code like:

readFile f & compressChunks cfg speed & take 0 & fold drain :: IO ()

since we'll never got Stream.Stop from readFile, or did I missing other path that would clean up Ptr C_LZ4Stream properly?

Use specific endianness for block headers

We should use a specific byte ordering instead of the machine byte ordering otherwise the code will not work across machines with different byte ordering. We can raise an issue for this and address it separately.

Unreproducable GHC 884 error

The CI for GHC 884 on Linux failed once. Unfortunately, the logs were not recorded and I don't know the right conditions to reproduce the error.

It was probably a memory issue but I'm not sure.

Fix build on Windows

https://ci.appveyor.com/project/harendra-kumar/streamly-lz4

Benchmark names are too long

We should not use the whole absolute path of the corpus file in the benchmark name.

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/large/bible.txt.normalized time                 41.77 ms

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/large/world192.txt.normalized time                 33.97 ms

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/cantrbry/alice29.txt.normalized time                 39.86 ms

Replace "error" calls with exceptions

write compressChunk and decompressChunk as folds

compressChunk :: Int -> Fold m (Array.Array Word8) (Array.Array Word8)

-- Accept properly resized chunks
decompressChunk :: Fold m (Array.Array Word8) (Array.Array Word8)

We can then implement compress and decompress in terms of these folds.

Proposed compress and decompress APIs

Compress:

compress: A convenience API that uses a default speedup factor and a default block size for compression
compressWith: ability to specify speedup and block size
compressRaw: compress the arrays in the input stream as is without resizing

Decompress:

decompress: resize the input arrays and decompress
decompressRaw: assume the arrays are of the right size already

Rename the APIs

The compress/decompress APIs work on chunks/arrays. All the APIs in stream that work on arrays are suffixed with chunk. We should use the same convention here as well:

compressChunks: current compress
decompressChunks: current decompress
compress: compress a stream of Word8 to a stream of Word8
decompress: decompress a stream of Word8 consisting of compressed chunks

Replace ParserD with Parser

Avoid using the `Array` constructor

Improve the Array API instead.

Try stream to stream compression/decompression APIs

For decompression, we could flatten the compressed input arrays into a stream and parse the stream into compressed chunks using a terminating fold. Check how this performs in comparison to array resizing.

For compression we could chunk the stream into arrays of 64K by default and provide an option to change the default.

We can provide both array as well as stream based APIs if arrays show some perf advantage otherwise we can keep just stream based ones.

Support streamly >= 0.9 (or streamly-core >= 0.1)

Hi, as far as I know, streamly-lz4 still uses streamly == 0.8.2 currently. Do we have any plan to migrate streamly-lz4 to new streamly API?

Tracking broken links

Streamly.LZ4

https://streamly.composewell.com/streamly-0.8.2/Compiling.html does not exist

Configuration nomenclature

Format vs Config vs Options

Issue with fusion

All combinators fuse, is they are the only combinators in the pipeline.
Adding any other combinator breaks fusion.

The problem seems to be with stuff not getting inlined.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.