I had some trouble using this library since the higher-level interface is undocumented

Hi, thanks to try to use decompress ! <p dir="auto

<div class="highlight highlight-source-ocaml notranslate position-relative overflow-auto" dir="auto"

Bigbuffer.add_bigstring copies the input: <p dir=

Include an even-higher level interface? about decompress HOT 6 CLOSED

mirage commented on August 24, 2024

Include an even-higher level interface?

from decompress.

Comments (6)

brendanlong commented on August 24, 2024 1

This comment says there's documentation somewhere on how to write a simple inflate/deflate function: #25 (comment)

I think just linking to where that is from this section of the readme would help a lot: https://github.com/mirage/decompress#higher-interface

I can understand the difficulty of making tradeoffs here, but it think it's generally fine to just pick something "good enough" for a high level interface and in the documentation/comments explain to people why they might want to use a lower-level version. I can understanding leaving that for a higher level library though, although you might want to mention that people looking for a higher-level interface could try [insert library here] in the docs too.

Note: Based on what you said above, I ended up making my high-level interface look like:

let inflate_bigstring_to_bigbuffer
    ?(buffer_size_bytes = 10000)
    ?(expected_compression_ratio = 3)
    input
  =
  let buffer = Bigstring.create buffer_size_bytes
  and output = Bigbuffer.create (Bigstring.length input * expected_compression_ratio) in
  let refill s = Bigstring.length s
  and flush s n = Bigbuffer.add_bigstring output (Bigstring.sub_shared s ~pos:0 ~len:n) in
  Gz.Higher.uncompress ~refill ~flush input buffer |> Result.map ~f:(fun _ -> output)
;;

let inflate_bigstring ?buffer_size_bytes ?expected_compression_ratio input =
  inflate_bigstring_to_bigbuffer ?buffer_size_bytes ?expected_compression_ratio input
  |> Result.map ~f:Bigbuffer.big_contents
;;

let inflate_string ?buffer_size_bytes ?expected_compression_ratio input =
  Bigstring.of_string input
  |> inflate_bigstring_to_bigbuffer ?buffer_size_bytes ?expected_compression_ratio
  |> Result.map ~f:Bigbuffer.contents
;;

I think the biggest downside of this is that it uses Bigbuffer from Core_kernel, but that lets it be fairly efficient except for a copy at the end to get the contents.

from decompress.

brendanlong commented on August 24, 2024

Not sure if you want to go there, but it would also be useful to have an interface for deflating an async input, like:

val inflate_pipe : string Async.Pipe.t -> string Async.Pipe.t

from decompress.

dinosaure commented on August 24, 2024

Hi, thanks to try to use decompress!

This question about an even simpler interface is well know (see #25). I agree with you that documentation is not the best and we definitely need to show a simple example like to_string/of_string as you said.

However, I would not like to propose such function in the package when responsibilities come. Indeed, the triptych memory/ratio/speed remains for any kind of inputs. For me, it's important to let the end-user to control such details.

For example, an inflate_string : string -> string needs to arbitrary choose the length of the output buffer. However, we don't have any clue about the length. We should say that the usual ratio of the compression is 1/3 and allocate a buffer such as Bigstringaf.create (String.length input * 3). But it's not true. Sometimes, the needed output buffer can be smaller than the input and sometimes it can be largely bigger (ratio of 1/5).
Such situation can be fixed by a use of a Queue as you did or a simple Buffer. We can use a ropes too. In anyway, several choices appears and none of them are perfects. A queue fits better for a fully unknowable input. A buffer can take the advantage of an hypothetical length, a ropes is the same advantage as the queue. However, for most of them, we delete the advantage of underlying used Bigarray and we must do a copy between Bigarray/string (or these data-structures are abstracted enough to directly manipulate Bigarray).
Again, the question about string/Bigarray is not clear. we know that for the context of the inflation/deflation, it's fair to use Bigarray, but for upper dependencies. In some contexts, for small objects, it better to manipulates string and use and re-use underlying Bigarray (due to the cost of the allocation). In some other contexts, it's better to keep Bigarray to be able to do some others treatments (such as a hash computing) and take the opportunity of the non-relocatable aspect of them.
Finally, may be you want to print out the output buffer. In that context, it's definitely better to allocate a large output buffer (independently the length of the input) and print out via a syscall.

So, several questions comes and none of them can perfectly fit into any uses. If we provide such function, we take the responsibility of its implementation - and, as I explained, such implementation leads several choices. We strictly knows that these choices can be really bad for some use-cases and to avoid any problems about that, we prefer to respect, at least, the interface provided by camlzip.

The current Higher module lets a full control of these details and even if most of them can be replaced by default values, we don't want to bring the user to a silly use of decompress - and it's when we should provide a better documentation! It implies several hidden choices and we don't want to hide the magic too much, and we don't want to take the responsibility of them.

Another aspect is about time, decompress is a full re-implementation of zlib (and gzip). By this way, we are more focus on algorithms/speed/memory consumption than a simpler interface. For such purpose, you can take a look on ezgzip which wants to experiment with decompress. I think, it's a better area if you want a simpler API.

And about async, decompress is a bedrock project of some others MirageOS-compatible projects. So we have a strict constraint about dependencies (no async, no lwt, no unix).

I will try to improve documentation next week and if you want to participate, I will happy to introduce your incomes 👍.

from decompress.

dinosaure commented on August 24, 2024

  and flush s n = Bigbuffer.add_bigstring output (Bigstring.sub_shared s ~pos:0 ~len:n) in

I'm not sure about sub_shared. Concretely, s is buffer. decompress never allocates temporary buffers and you should copy bytes from s/buffer into output. I don't know the semantic of Bigbuffer.add_bigstring (if it does the copy or not) but if it re-use s, the output buffer will be wrong.

Then, I added a documentation, add examples and clear some aspects of decompress, you can see it here: #116

from decompress.

brendanlong commented on August 24, 2024

Bigbuffer.add_bigstring copies the input:

https://github.com/janestreet/core_kernel/blob/0ca444d236acc08bbb52e50c67af69a3fa7f8bd7/src/bigbuffer.ml#L100

So I think what I was doing make sense (I'm trying to let the function do a copy of the substring without doing an additional copy).

Thanks for adding more documentation to this!

from decompress.

dinosaure commented on August 24, 2024

Thanks for the feedback 👍.

from decompress.

Include an even-higher level interface? about decompress HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent