whatwg / compression Goto Github PK

View Code? Open in Web Editor NEW

82.0 22.0 21.0 1.71 MB

Compression Standard

Home Page: https://compression.spec.whatwg.org/

License: Other

Makefile 6.51% HTML 93.49%

javascript streams streams-api compression standard whatwg

compression's People

Stargazers

Watchers

Forkers

canonmukai guest271314 tobie jeremyroman mathiasbynens ismiyati malvoz domenic katerega global-localhost global19 global19-atlassian-net igalia yutakahirano stwrt fhanau saschanaz seanpm2001

compression's Issues

Blocking APIs in Workers?

From a usability perspective, a blocking API is very attractive. However, the impact on interactivity would be unacceptable on the main thread.

An interesting possibility is to provide a blocking API only in workers.

Maybe something like

[Exposed=Worker] static Uint8Array blockingCompress(DOMString format, BufferSource input);

See also #8 for an equivalent non-blocking method and #37 for motivation.

I have a suspicion that using text/event-stream with CompressiomStream could cause delays in delivering events to consumers but I'm not sure that's true or not. It seems to me that after enqueing some event(s)/chunk(s) that the compression stream could be in a waiting state for more bytes to compress before it enques corresponding compressed chunk(s).

Is some flush() method needed in order to reliably send a compressed stream of events and to guarantee that individual events are not delayed?

Demo: https://dash.deno.com/playground/event-stream-example

Should the spec allow enqueueing on-the-go instead of storing them until the input is exhausted?

https://wicg.github.io/compression/#decompress-and-enqueue-a-chunk

Per the spec it's a must to complete the conversion first and then do the enqueue, but would it be bad to enqueue as soon as each output buffer is filled?

Not that there's any important reason to do that, just curious. Maybe enqueuing all at once (as the spec says) makes sure more consistent behavior among implementations, as impls with smaller buffer might enqueue things before error while others with larger buffer might enqueue nothing.

Compress arraybuffer example

https://github.com/wicg/compression/blob/master/explainer.md#deflate-compress-an-arraybuffer-to-a-uint8array

It could be:

function compressArrayBuffer(input) {
  const stream = new Blob(input)
    .stream()
    .pipeThrough(new CompressionStream('deflate'));
  return new Response(stream).arrayBuffer();
}

I guess it's a little hacky, but the following example already uses blob/response, so maybe it's ok?

DecompressionStream should support pull-based decompression

What is the issue with the Compression Standard?

I have a web app which consumes gzipped data. On large datasets, the decompressed size can be a gigabyte or more. I would like to minimize memory usage by never materializing the fully decompressed buffer into memory; instead, I would like to decompress in chunks and then process the decompressed chunks.

It seems that I cannot use DecompressionStream to achieve this at the moment. As specced, when the compressed data comes from the network, the browser has to decompress it all and enqueue the decompressed chunks.
I would prefer to have the browser only hold on to the compressed memory until I ask for the next decompressed chunk.

Support "deflate-raw" format

For applications such as reading/writing zip files it's necessary to have access to the raw deflate stream without any headers or footers. We should add a "deflate-raw" format to support this.

See previous discussion at ricea/compressstream-explainer#8.

List of supported formats should be more explicit

Currently the compression formats we support are only mentioned non-normatively in the Introduction, and then explained in Terminology section. Probably there needs to be a section with a table of supported formats and relevant standards.

The served document is still in W3C style

What is the issue with the Compression Standard?

I guess the bikeshed metadata needs some changes?

Match format names in https://www.iana.org/assignments/http-parameters/http-parameters.xhtml#content-coding

It would be good to explicitly match the format names in https://www.iana.org/assignments/http-parameters/http-parameters.xhtml#content-coding. Currently we implicitly match them, so it's not a big change.

It's unclear what to do about possible future formats that aren't used for HTTP.

Handling of trailing data after deflate-raw final block

Following up on the discussion in #43 (comment), the current wording of how to handle deflate-raw data does not state how to handle data after the end of a raw deflate stream (i.e. blocks or arbitrary data following a block with the BFINAL flag.)
The requirement that non-conforming blocks must be treated as errors could be interpreted as not allowing any trailing data, which would match the requirements for gzip and non-raw deflate as well as what zlib is doing, but this is not obvious from the specification.

Readable byte stream

Looking at the spec, it seems like these transforms create regular readable streams. Should they be readable byte streams?

Support for decompressing multi-member gzip files?

The GZIP spec includes support for one or more members (A gzip file consists of a series of "members" (compressed data sets).
but this spec currently states A gzip stream may only contain one "member"..

This is a request to be able to support decompressing multi-member gzip files to support use cases that depend on multi-member gzip (such as parsing ISO WARC).

Alternatively, if a way to get the unused data as in #39 would allow the developer to implement this manually, those ideal solution for my use case be native support for a multi-member DecompressionStream.

Could provide a proposed spec if there would be interest..

Why TypeError instead of DOMException DataError for decompression failure?

https://wicg.github.io/compression/#decompress-flush-and-enqueue

If the end of the compressed input has not been reached, then throw a TypeError.

Not that it matters too much, but not sure TypeError makes much sense there.

Resolved Promises of WritableStreamDefaultWriter Methods

I discovered that the methods for WritableStreamDefaultWriter (at least when used with CompressionStream and DecompressionStream) don't properly resolve the same across various implementations. I was wondering if there is a defined standard behavior as to when they should return? My discoveries for this error were tracked along this issue here.

As mentioned in the WICG draft report, you can use the Compression Streams API with ArrayBuffer object by using a WritableStreamDefaultWriter, making use of the writer.write() and writer.close() methods to chunk in parts of the ArrayBuffer.

Deflate-compress an ArrayBuffer to a Uint8Array

These two methods each return Promises however, so I thought it would make sense to use them with await to ensure that any errors that happen with their calls could be bubbled back up to the main async function implementation. This doesn't appear to work in all platform implementations though. Adding these await calls works correctly in Node.js' implementation, but not in Chrome, Safari, nor Firefox's implementations.

Using writer.write() and writer.close() with await will never resolve in all three browsers, while it will resolve properly/timely with undefined in Node.js. Is this because Node.js has a unique stream implementation to that of the browsers? I don't see why these never resolve in the browsers.

The interesting part is that they all do throw errors (where applicable, say if the data was decompressed using an incompatible decompression format that doesn't match the format the ArrayBuffer uses).

So without being able to use await calls on these methods, you have to manually catch the errors with your own .catch() handlers, which feels similar to that of what you have to do when using for await loops, which Jake Archibald covered in an article on his blog.

I mainly discovered this in my own project, where I am using the Compression Streams API to discern the compression format of a given ArrayBuffer file, by the use of nested try-catch statements to find out what the file was or wasn't compressed with (Example code, similar to that of my project).

declare type CompressionFormat = "deflate" | "deflate-raw" | "gzip";

// This is where the `WritableStreamDefaultWriter` implementation is located.
declare function decompress(data: Uint8Array, format: CompressionFormat): Promise<Uint8Array>;

export type Compression = CompressionFormat | null;

export interface ReadOptions {
  compression?: Compression;
}

export async function read(data: Uint8Array, { compression }: ReadOptions = {}){

  // Snippet of the function body; incomplete

  if (compression === undefined){
    try {
      return await read(data,{ compression: null });
    } catch (error){
      try {
        return await read(data,{ compression: "gzip" });
      } catch {
        try {
          return await read(data,{ compression: "deflate" });
        } catch {
          try {
            return await read(data,{ compression: "deflate-raw" });
          } catch {
            throw error;
          }
        }
      }
    }
  }

  // From this point in the function body, `compression` can be anything except `undefined`.
  compression;

  if (compression !== null){
    data = await decompress(data,compression);
  }

  // `data` is now uncompressed, no compression format had to be known ahead of time.
  data;

}

User-supplied algorithms and algorithm enumeration

There should be a way to register algorithms that are not built-in but can be used to construct CompressionStream and DecompressionStreams in the same realm.

@domenic provided the following API sketch:

dictionary TransformStreamInit {
  required ReadableStream readable;
  required WritableStream writable;
};

callback CompressionAlgorithm = Promise<TransformStreamInit> (DOMString format, object options);

interface CompressionAlgorithms {
  maplike<DOMString, CompressionAlgorithm>;
};

partial interface CompressionStream {
  static CompressionAlgorithms algorithms;
};

CompressionStream.algorithms.set(
  "foo",
  () => ({ readable: ..., writable: ... })
);

CompressionStream.algorithms.has('foo');
CompressionStream.algorithms.keys();

Internally, the set of available algorithms would be stored in an internal slot on the global object.

Get unused data from end of DecompressionStream?

I'm not very familiar with streams and compression, but hopefully this is understandable.

For deflate, the spec states "It is an error if there is additional input data after the ADLER32 checksum."
For gzip, the spec says "It is an error if there is additional input data after the end of the "member"."

As expected, Chrome's current implimentation throws a TypeError ("Junk found after end of compressed data.") when extra data is written to a DecompressionStream.

This error can be caught and ignored, but there doesn't seem to be a way of retrieving the already-written-but-not-used "junk" data. There seems to be an assumption here that developers already know the length of the compressed data, and can provide exactly that data and nothing more. On the contrary, this "junk" data can be very important in cases where the compressed data is embedded in another stream and you don't know the length of the compressed data.

A good example of this is Git's PackFile format, which only tells you the size of the uncompressed data, not the compressed size. In such a case you must rely on the decompressor to tell you when it's done decompressing data, and then handle the remaining data.

My attempt at putting together an example:

// A stream with two compressed items
// deflate("Hello World") + deflate("FooBarBaz")
const data = new Uint8Array([
    0x78, 0x9c, 0xf3, 0x48, 0xcd, 0xc9, 0xc9, 0x57, 0x08, 0xcf, 0x2f, 0xca, 0x49, 0x01, 0x00, 0x18, 0x0b, 0x04, 0x1d,
    0x78, 0x9c, 0x73, 0xcb, 0xcf, 0x77, 0x4a, 0x2c, 0x72, 0x4a, 0xac, 0x02, 0x00, 0x10, 0x3b, 0x03, 0x57,
]);

// Decompress the first item
const item1Stream = new DecompressionStream('deflate');
item1Stream.writable.getWriter().write(data).catch(() => { /* Rejects with a TypeError: Junk found after end of compressed data. */ });
console.log(await item1Stream.readable.getReader().read()); // "Hello World"

// How do I get the remaining data (the "junk") in order to decompress the second item?
// I've already written it to the previous stream, and there's nothing to tell me how much was used or what's left over.
const item2Stream = new DecompressionStream('deflate');
item2Stream.writable.getWriter().write(getRemainingDataSomehow());
console.log(await item2Stream.readable.getReader().read()); // "FooBarBaz"

Now, as a workaround, I could write the data to my first stream one byte at a time, saving the most recently written byte and carrying it over when the writer throws that specific exception - But writing one byte at a time feels very inefficient and adds a lot of complexity, and checking for that specific error message seems fragile (it might chage, and other implimentations might use a different message.)

Zlib itself provides a way to know what bytes weren't used (though I don't know any details about how.)
Python's zlib api provides an unused_data property that contains the unused bytes.
Node's zlib api provides a bytesWritten property that can be used to calculate the unused data.
It would be great to have something similar available in the DecompressionStream api.

Add a one-shot method?

Maybe we should have something like

static Promise<Uint8Array> compress(DOMString format, BufferSource input);

in CompressionStream, and a similar API for DecompressionStream, to make one-shot compression and decompression easier to do.

Supply parameters for compression algorithms

Even for built in algorithms (gzip and deflate), there are various parameters that users can supply which can unlock some use cases. Examples:

compression level - determining the tradeoff between CPU and compression ratio
Flushing strategy - Flushing strategies can have significant impact on the tradeoff between added latency and compression ratios when compressing streams. Also, some algorithms (e.g. SSH) rely on very specific flushing strategies, so implementing them using this API may rely on being able to set those specific flushing strategies.

Standards home?

Will this become part of Streams or do we create a new "top-level" compression standard somewhere?

Conformance should be in terms of Infra

What is the issue with the Compression Standard?

There should be no need for as much boilerplate as there is now nor a need to reference 2119.

Shipping status? (Unity3D use cases)

Hey,

just was pointed to this spec by Kai Ninomiya. At Unity3D we notice time and time again that it would be great to have built-in browser provided support for compressing and decompressing. Our wish list:

access to gzip, deflate and brotli
ability to do both compression and decompression
ability to do streamed decompression
ability to do synchronous compression in Workers

Some of the use cases that we have had:

Unity3D employs large binary files for bundled game assets (a ".data" file blob typically for 50MB-100MB per game). This asset must be compressed to ensure fast download times, but we do not want to compress it on-demand on the web server, like other .html and .js content might be done, so we do "pre-compressed" asset files. In this scheme, we typically name the file on disk as "game.data.br" or "game.data.gz", and precompress it in advance. Then we demand that the user configures their web server to serve the file with Content-Encoding: gzip (or br).

This way server's on-demand compression is not needed, to avoid taxing the server CPU and its compression caches, and users get a compressed file over the wire.

However it is very common with Unity3D game developers that when it comes a time to host a game, they do not have access to configuring web servers arbitrarily. This means that they are unable, either due to not being web admins, or due to not having the knowledge, to add the Content-Encoding header. This means that Unity will download gzipped asset files directly into the engine.

We uncompress the file in Wasm, but experience shows that this is slower than what native browser side could do, and we have lost streamed decompression due to not having an appropriate streaming decompression in place (this could be implemented, but never quite had the cycles to do so).

If the browser exposed support for browser's own decompressors, then we could just feed the file over to that, and not have to worry about a performance loss, even if the user was not able to set up the header fields correctly.

Tiny Unity https://unity.com/solutions/instant-games is a piece of technology that enables executing web games in a very small shipped footprint. The small size is largely achieved by reusing the codecs that a browser already ships: instead of compiling libpng, we reuse browser's image decoding capabilities. Web Audio decompression instead of shipping a compiled libmp3/libvorbis. Browser fonts instead of shipping a compiled Freetype 2. And so on.

Though for compression Tiny Unity does not have a solution, and hence we compile in zlib. It is not the largest in size, but it would be nice to be able to remove that - and in particular have the browser provide a brotli codec for the page to use. Shipping one codec with the engine is tough, but having to ship multiple is even tougher, which is causing friction to attempt to update.

In multithreaded Wasm SharedArrayBuffer based applications, we find the desire to create web workers that perform computational tasks for the main thread. In our setup, we have a task queue in the Wasm SAB, and the workers keep grabbing tasks from the queue as soon as they arrive. One of these tasks includes zip compression or decompression.

In this kind of setup, it would be very useful to be able to synchronously run a compression or a decompression activity in a Worker. This way the Worker could retain its callstack context in connection with the Wasm SAB work queue, rather than having to structure the programs to yield to the Worker's event loop in order to be able to process other functions. Yielding has been observed to worsen latency in real-time rendering applications, which target interactive frame rates.

Synchronous compression and decompression is naturally not expected to work on the main browser thread.

So I am curious what is the latest status of this spec, and I wonder how the above use cases would mesh in with the existing spec?

Thanks!

Error handling for decompression is under-specified

There are many ways compressed input can be invalid, including at least

Invalid magic number
Other invalid header value (for example, invalid compression ID in gzip)
Incorrect header checksum
Invalid compressed bytes
Incorrect body checksum
Trailing garbage

The specification doesn't currently specify which if any of these should error the stream.

Probably they all should, although we may want options to disable some kinds of errors to make it easier to handle badly-formed input.

Use enum for `format` parameter?

Currently it's accepting string, but given that the constructors throw on unsupported format, maybe it can be an enum?

Include brotli

It would be good to support all the compression, or at least decompression, schemes supported by servers and browsers. This looks like a great step forward to not having to write compression and decompression in javascript.

My use-case would be to download a bunch of brotli files in an archive (somewhat like tar), split them into individual files (unarchive, but don't decompress yet) in the service-worker cache, and return them to the browser as called for.

Need to take into account that compressors are stateful

Currently the standard says

Let buffer be the result of compressing chunk with cs's format.

However, we also need to pass the current state into the compression algorithm. Probably we should create an empty state in the constructor and then use it every time we compress a buffer.

Support custom dictionaries

The "deflate" format supports preset dictionaries. These permit backreferences to be used from the start of the data to refer to items in the dictionary as if it was prepended to the uncompressed data. This can give significant improvements in compression ratio, particularly for small inputs. See FDICT in RFC1950. This is also a common feature in other compression formats.

This should be supported by CompressionStream and DecompressionStream.

For CompressionStream, an obvious API would be

const cs = new CompressionStream("deflate", { dictionary: aBufferSourceObject });

An open question is whether it is necessary to be able to pass multiple dictionaries to DecompressionStream (keyed by the Adler32 checksum), or whether just passing a single dictionary is sufficient. If we only support passing a single dictionary, this requires the calling code to either know by some out-of-band method what dictionary is in use, or parse the Adler32 checksum out of the header itself to choose the right dictionary.

Getting snappy over Snappy

In terms of adding additional compression algorithms, I'll take the stand and the cry for Google's Snappy or sz. It's what I've been using lately, regardless of how popular or old it is. While everyone cries for brotli, I'll leave this here.

Sorry, I've spent a lot of time mangling at the thought of hacking together a solution to support custom algos, on a client-browser tonight. I'm wrapping up the development of a production-ready back-end for all of my personal projects to come and I obviously was able to encode my large data files, as well as network requests with it, very easily. Which lead me down this rabbit hole; the least I could do is open an issue or leave a comment to further nudge the development of the web!

I'm actually building my own client, so while this sounds entirely specific, it will be thanks to Wails. So whenever WebKitGTK becomes one of the first browsers to support anything better or more unique than what's commonly found, I'll be happy!

Can't believe this was one of the only people crying out for it as well, 10 whole years ago.

Thank you for coming to my Ted talk, just wanted to annoy you all out of love.

whatwg / compression Goto Github PK

compression's People

Stargazers

Watchers

Forkers

compression's Issues

What is the issue with the Compression Standard?

What is the issue with the Compression Standard?

What is the issue with the Compression Standard?

Recommend Projects

Recommend Topics

Recommend Org