whatwg / compression Goto Github PK
View Code? Open in Web Editor NEWCompression Standard
Home Page: https://compression.spec.whatwg.org/
License: Other
Compression Standard
Home Page: https://compression.spec.whatwg.org/
License: Other
From a usability perspective, a blocking API is very attractive. However, the impact on interactivity would be unacceptable on the main thread.
An interesting possibility is to provide a blocking API only in workers.
Maybe something like
[Exposed=Worker] static Uint8Array blockingCompress(DOMString format, BufferSource input);
See also #8 for an equivalent non-blocking method and #37 for motivation.
I have a suspicion that using text/event-stream
with CompressiomStream
could cause delays in delivering events to consumers but I'm not sure that's true or not. It seems to me that after enqueing some event(s)/chunk(s) that the compression stream could be in a waiting state for more bytes to compress before it enques corresponding compressed chunk(s).
Is some flush()
method needed in order to reliably send a compressed stream of events and to guarantee that individual events are not delayed?
https://wicg.github.io/compression/#decompress-and-enqueue-a-chunk
Per the spec it's a must to complete the conversion first and then do the enqueue, but would it be bad to enqueue as soon as each output buffer is filled?
Not that there's any important reason to do that, just curious. Maybe enqueuing all at once (as the spec says) makes sure more consistent behavior among implementations, as impls with smaller buffer might enqueue things before error while others with larger buffer might enqueue nothing.
It could be:
function compressArrayBuffer(input) {
const stream = new Blob(input)
.stream()
.pipeThrough(new CompressionStream('deflate'));
return new Response(stream).arrayBuffer();
}
I guess it's a little hacky, but the following example already uses blob/response, so maybe it's ok?
I have a web app which consumes gzipped data. On large datasets, the decompressed size can be a gigabyte or more. I would like to minimize memory usage by never materializing the fully decompressed buffer into memory; instead, I would like to decompress in chunks and then process the decompressed chunks.
It seems that I cannot use DecompressionStream to achieve this at the moment. As specced, when the compressed data comes from the network, the browser has to decompress it all and enqueue the decompressed chunks.
I would prefer to have the browser only hold on to the compressed memory until I ask for the next decompressed chunk.
For applications such as reading/writing zip files it's necessary to have access to the raw deflate stream without any headers or footers. We should add a "deflate-raw" format to support this.
See previous discussion at ricea/compressstream-explainer#8.
Currently the compression formats we support are only mentioned non-normatively in the Introduction, and then explained in Terminology section. Probably there needs to be a section with a table of supported formats and relevant standards.
I guess the bikeshed metadata needs some changes?
It would be good to explicitly match the format names in https://www.iana.org/assignments/http-parameters/http-parameters.xhtml#content-coding. Currently we implicitly match them, so it's not a big change.
It's unclear what to do about possible future formats that aren't used for HTTP.
Following up on the discussion in #43 (comment), the current wording of how to handle deflate-raw data does not state how to handle data after the end of a raw deflate stream (i.e. blocks or arbitrary data following a block with the BFINAL flag.)
The requirement that non-conforming blocks must be treated as errors could be interpreted as not allowing any trailing data, which would match the requirements for gzip and non-raw deflate as well as what zlib is doing, but this is not obvious from the specification.
Looking at the spec, it seems like these transforms create regular readable streams. Should they be readable byte streams?
The GZIP spec includes support for one or more members (A gzip file consists of a series of "members" (compressed data sets).
but this spec currently states A gzip stream may only contain one "member".
.
This is a request to be able to support decompressing multi-member gzip files to support use cases that depend on multi-member gzip (such as parsing ISO WARC).
Alternatively, if a way to get the unused data as in #39 would allow the developer to implement this manually, those ideal solution for my use case be native support for a multi-member DecompressionStream.
Could provide a proposed spec if there would be interest..
https://wicg.github.io/compression/#decompress-flush-and-enqueue
- If the end of the compressed input has not been reached, then throw a TypeError.
Not that it matters too much, but not sure TypeError makes much sense there.
I discovered that the methods for WritableStreamDefaultWriter
(at least when used with CompressionStream
and DecompressionStream
) don't properly resolve the same across various implementations. I was wondering if there is a defined standard behavior as to when they should return? My discoveries for this error were tracked along this issue here.
As mentioned in the WICG draft report, you can use the Compression Streams API with ArrayBuffer
object by using a WritableStreamDefaultWriter
, making use of the writer.write()
and writer.close()
methods to chunk in parts of the ArrayBuffer
.
These two methods each return Promise
s however, so I thought it would make sense to use them with await
to ensure that any errors that happen with their calls could be bubbled back up to the main async function
implementation. This doesn't appear to work in all platform implementations though. Adding these await
calls works correctly in Node.js' implementation, but not in Chrome, Safari, nor Firefox's implementations.
Using writer.write()
and writer.close()
with await
will never resolve in all three browsers, while it will resolve properly/timely with undefined
in Node.js. Is this because Node.js has a unique stream implementation to that of the browsers? I don't see why these never resolve in the browsers.
The interesting part is that they all do throw errors (where applicable, say if the data was decompressed using an incompatible decompression format that doesn't match the format the ArrayBuffer
uses).
So without being able to use await
calls on these methods, you have to manually catch the errors with your own .catch()
handlers, which feels similar to that of what you have to do when using for await
loops, which Jake Archibald covered in an article on his blog.
I mainly discovered this in my own project, where I am using the Compression Streams API to discern the compression format of a given ArrayBuffer
file, by the use of nested try
-catch
statements to find out what the file was or wasn't compressed with (Example code, similar to that of my project).
declare type CompressionFormat = "deflate" | "deflate-raw" | "gzip";
// This is where the `WritableStreamDefaultWriter` implementation is located.
declare function decompress(data: Uint8Array, format: CompressionFormat): Promise<Uint8Array>;
export type Compression = CompressionFormat | null;
export interface ReadOptions {
compression?: Compression;
}
export async function read(data: Uint8Array, { compression }: ReadOptions = {}){
// Snippet of the function body; incomplete
if (compression === undefined){
try {
return await read(data,{ compression: null });
} catch (error){
try {
return await read(data,{ compression: "gzip" });
} catch {
try {
return await read(data,{ compression: "deflate" });
} catch {
try {
return await read(data,{ compression: "deflate-raw" });
} catch {
throw error;
}
}
}
}
}
// From this point in the function body, `compression` can be anything except `undefined`.
compression;
if (compression !== null){
data = await decompress(data,compression);
}
// `data` is now uncompressed, no compression format had to be known ahead of time.
data;
}
There should be a way to register algorithms that are not built-in but can be used to construct CompressionStream and DecompressionStreams in the same realm.
@domenic provided the following API sketch:
dictionary TransformStreamInit {
required ReadableStream readable;
required WritableStream writable;
};
callback CompressionAlgorithm = Promise<TransformStreamInit> (DOMString format, object options);
interface CompressionAlgorithms {
maplike<DOMString, CompressionAlgorithm>;
};
partial interface CompressionStream {
static CompressionAlgorithms algorithms;
};
CompressionStream.algorithms.set(
"foo",
() => ({ readable: ..., writable: ... })
);
CompressionStream.algorithms.has('foo');
CompressionStream.algorithms.keys();
Internally, the set of available algorithms would be stored in an internal slot on the global object.
I'm not very familiar with streams and compression, but hopefully this is understandable.
For deflate
, the spec states "It is an error if there is additional input data after the ADLER32 checksum."
For gzip
, the spec says "It is an error if there is additional input data after the end of the "member"."
As expected, Chrome's current implimentation throws a TypeError ("Junk found after end of compressed data.") when extra data is written to a DecompressionStream.
This error can be caught and ignored, but there doesn't seem to be a way of retrieving the already-written-but-not-used "junk" data. There seems to be an assumption here that developers already know the length of the compressed data, and can provide exactly that data and nothing more. On the contrary, this "junk" data can be very important in cases where the compressed data is embedded in another stream and you don't know the length of the compressed data.
A good example of this is Git's PackFile format, which only tells you the size of the uncompressed data, not the compressed size. In such a case you must rely on the decompressor to tell you when it's done decompressing data, and then handle the remaining data.
My attempt at putting together an example:
// A stream with two compressed items
// deflate("Hello World") + deflate("FooBarBaz")
const data = new Uint8Array([
0x78, 0x9c, 0xf3, 0x48, 0xcd, 0xc9, 0xc9, 0x57, 0x08, 0xcf, 0x2f, 0xca, 0x49, 0x01, 0x00, 0x18, 0x0b, 0x04, 0x1d,
0x78, 0x9c, 0x73, 0xcb, 0xcf, 0x77, 0x4a, 0x2c, 0x72, 0x4a, 0xac, 0x02, 0x00, 0x10, 0x3b, 0x03, 0x57,
]);
// Decompress the first item
const item1Stream = new DecompressionStream('deflate');
item1Stream.writable.getWriter().write(data).catch(() => { /* Rejects with a TypeError: Junk found after end of compressed data. */ });
console.log(await item1Stream.readable.getReader().read()); // "Hello World"
// How do I get the remaining data (the "junk") in order to decompress the second item?
// I've already written it to the previous stream, and there's nothing to tell me how much was used or what's left over.
const item2Stream = new DecompressionStream('deflate');
item2Stream.writable.getWriter().write(getRemainingDataSomehow());
console.log(await item2Stream.readable.getReader().read()); // "FooBarBaz"
Now, as a workaround, I could write the data to my first stream one byte at a time, saving the most recently written byte and carrying it over when the writer throws that specific exception - But writing one byte at a time feels very inefficient and adds a lot of complexity, and checking for that specific error message seems fragile (it might chage, and other implimentations might use a different message.)
Zlib itself provides a way to know what bytes weren't used (though I don't know any details about how.)
Python's zlib api provides an unused_data
property that contains the unused bytes.
Node's zlib api provides a bytesWritten
property that can be used to calculate the unused data.
It would be great to have something similar available in the DecompressionStream api.
Maybe we should have something like
static Promise<Uint8Array> compress(DOMString format, BufferSource input);
in CompressionStream, and a similar API for DecompressionStream, to make one-shot compression and decompression easier to do.
Even for built in algorithms (gzip and deflate), there are various parameters that users can supply which can unlock some use cases. Examples:
Will this become part of Streams or do we create a new "top-level" compression standard somewhere?
There should be no need for as much boilerplate as there is now nor a need to reference 2119.
Hey,
just was pointed to this spec by Kai Ninomiya. At Unity3D we notice time and time again that it would be great to have built-in browser provided support for compressing and decompressing. Our wish list:
Some of the use cases that we have had:
Content-Encoding: gzip
(or br
).This way server's on-demand compression is not needed, to avoid taxing the server CPU and its compression caches, and users get a compressed file over the wire.
However it is very common with Unity3D game developers that when it comes a time to host a game, they do not have access to configuring web servers arbitrarily. This means that they are unable, either due to not being web admins, or due to not having the knowledge, to add the Content-Encoding header. This means that Unity will download gzipped asset files directly into the engine.
We uncompress the file in Wasm, but experience shows that this is slower than what native browser side could do, and we have lost streamed decompression due to not having an appropriate streaming decompression in place (this could be implemented, but never quite had the cycles to do so).
If the browser exposed support for browser's own decompressors, then we could just feed the file over to that, and not have to worry about a performance loss, even if the user was not able to set up the header fields correctly.
Though for compression Tiny Unity does not have a solution, and hence we compile in zlib. It is not the largest in size, but it would be nice to be able to remove that - and in particular have the browser provide a brotli codec for the page to use. Shipping one codec with the engine is tough, but having to ship multiple is even tougher, which is causing friction to attempt to update.
In this kind of setup, it would be very useful to be able to synchronously run a compression or a decompression activity in a Worker. This way the Worker could retain its callstack context in connection with the Wasm SAB work queue, rather than having to structure the programs to yield to the Worker's event loop in order to be able to process other functions. Yielding has been observed to worsen latency in real-time rendering applications, which target interactive frame rates.
Synchronous compression and decompression is naturally not expected to work on the main browser thread.
So I am curious what is the latest status of this spec, and I wonder how the above use cases would mesh in with the existing spec?
Thanks!
There are many ways compressed input can be invalid, including at least
The specification doesn't currently specify which if any of these should error the stream.
Probably they all should, although we may want options to disable some kinds of errors to make it easier to handle badly-formed input.
See also ricea/compressstream-explainer#9.
Currently it's accepting string, but given that the constructors throw on unsupported format, maybe it can be an enum?
It would be good to support all the compression, or at least decompression, schemes supported by servers and browsers. This looks like a great step forward to not having to write compression and decompression in javascript.
My use-case would be to download a bunch of brotli files in an archive (somewhat like tar), split them into individual files (unarchive, but don't decompress yet) in the service-worker cache, and return them to the browser as called for.
Currently the standard says
Let buffer be the result of compressing chunk with cs's format.
However, we also need to pass the current state into the compression algorithm. Probably we should create an empty state in the constructor and then use it every time we compress a buffer.
The "deflate" format supports preset dictionaries. These permit backreferences to be used from the start of the data to refer to items in the dictionary as if it was prepended to the uncompressed data. This can give significant improvements in compression ratio, particularly for small inputs. See FDICT in RFC1950. This is also a common feature in other compression formats.
This should be supported by CompressionStream and DecompressionStream.
For CompressionStream, an obvious API would be
const cs = new CompressionStream("deflate", { dictionary: aBufferSourceObject });
An open question is whether it is necessary to be able to pass multiple dictionaries to DecompressionStream (keyed by the Adler32 checksum), or whether just passing a single dictionary is sufficient. If we only support passing a single dictionary, this requires the calling code to either know by some out-of-band method what dictionary is in use, or parse the Adler32 checksum out of the header itself to choose the right dictionary.
In terms of adding additional compression algorithms, I'll take the stand and the cry for Google's Snappy or sz
. It's what I've been using lately, regardless of how popular or old it is. While everyone cries for brotli, I'll leave this here.
Sorry, I've spent a lot of time mangling at the thought of hacking together a solution to support custom algos, on a client-browser tonight. I'm wrapping up the development of a production-ready back-end for all of my personal projects to come and I obviously was able to encode my large data files, as well as network requests with it, very easily. Which lead me down this rabbit hole; the least I could do is open an issue or leave a comment to further nudge the development of the web!
I'm actually building my own client, so while this sounds entirely specific, it will be thanks to Wails
. So whenever WebKitGTK
becomes one of the first browsers to support anything better or more unique than what's commonly found, I'll be happy!
Can't believe this was one of the only people crying out for it as well, 10 whole years ago.
Thank you for coming to my Ted talk, just wanted to annoy you all out of love.
compression.spec.whatwg.org is not among the domain names associated with the certificates served on https://compression.spec.whatwg.org/
Issue for discussion of adding support for the zstd format to the API.
CompressionStream.prototype.format
should be a getter which returns the format the stream was created with. Similarly DecompressionStream.prototype.format
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.