nigeltao / qoi2-bikeshed Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 0.0 2 KB

"Quite OK Image" version 2 discussions

qoi2-bikeshed's People

Contributors

Stargazers

Watchers

qoi2-bikeshed's Issues

Applying QOI for video

Given the simplicity of the QOI format, I think it might be useful as a basis for some kind of video codec, for example, to output video files from a C program without excessive complexity. Say I wanted to write a C program that generates a series of images and produces a video where each image is a frame. I could either output each frame as an image file and use a tool like FFMPEG to combine them, or try to use existing codecs/formats in my program relying either on external libraries or much more complicated algorithms. If a video format based on QOI existed, it may not be so difficult to simply output the video itself directly from my program without all the complexities of more sophisticated formats, so the work could be done quickly at the expense of larger (but hopefully smaller than raw) file sizes.

That being said, storing one roughly-PNG sized image per frame will quickly take a lot of space, so any tricks that can reduce size, including possibly lossy compression, can be useful. I am trying to think of what kind of methods could be employed to save space for sequences of frames. Using YUV color space downsampling to YUV4:2:0 cuts the raw data in half, and storing this in existing QOI format seems trivial, just store each 4 pixels' worth of YYU-YYV instead of the canonical 2 pixels' worth of RGB-RGB. Or perhaps, if alpha is not being stored in the video, 8 pixels can be stored as YYYY-YYYY-UUVV (or any other permutation), taking up the space of only 3. If something like this is used, perhaps there should be separate caches for chrominance and luminance. It might prove kind of tricky since each entry in the cache will represent data across multiple pixels instead of just one.

Then, I'm wondering if there's any good way of adding some kind of quantization to further reduce image size. For example, when hashing pixels to store in the cache, discarding N low bits of each component, then instead of checking that the existing entry in the cache matches the current pixel, check if it is within some tolerance threshold and, if so, treat it as a cache hit. Also could add a tolerance threshold for the RLE.

I also saw a suggestion in another issue of allowing the DIFF opcodes to be scaled, to allow them to cover a greater range at the expense of precision.

Then, if this is used for a video of sequential frames, perhaps there is a way to exploit this to produce inter-frames that compress to even smaller. The simplest way would just be to subtract the previous frame from the current, pixel by pixel. Maybe there's a way to use motion compensation, but I don't think I have nearly a good enough idea of how to do that on my own to make any claims. Maybe something using multi-pixel macro-blocks, but I am hoping for ideas and a discussion as to what might work. When I get the chance, I'll try to do some testing to get actual results for the ideas I've had above.

QOI_DELTA Proof of Concept

QOI_DELTA might be a viable replacement for QOI_DIFF_8. It can store -1..1 for each of rgb compared to QOI_DIFF_8 which stores -2..1. Narrowing the range allows it to be packed into 5 bits with a 3 bit tag. There's 5 unused values which can be used as 8 bit tags. Assuming special-casing RGB and RGBA as a last-resort encoding instead of the wasteful QOI_COLOR op is the way forwards, we will probably need some space for 8 bit tags so this is convenient. In the example below they've been filled with QOI_COLOR_RGB, QOI_COLOR_RGBA, QOI_COLOR_A, QOI_RUN_16, QOI_RUN_24 (QOI_COLOR_A doesn't seem very useful). The example hasn't been tuned for efficiency or compression and some of the other opcode choices are questionable, this is a WIP that is being iterated on and only meant as an example of QOI_DELTA.

// opcodes
#define QOI_INDEX      0x80 // 1xxxxxxx i7
#define QOI_DIFF_16    0x40 // 01rrrrrg ggggbbbb : r5g5b4
#define QOI_DELTA      0x20 // 001xxxxx (where x is in the range 0..26)
#define QOI_DIFF_24    0x10 // 0001xxxx
#define QOI_RUN_8      0x00 // 0000xxxx
// special 8 bit opcodes stored in the upper range of delta
#define QOI_COLOR_RGB  0x3b // Following 3 bytes RGB
#define QOI_COLOR_RGBA 0x3c // Following 4 bytes RGBA
#define QOI_COLOR_A    0x3d // Following byte A
#define QOI_RUN_16     0x3e // 8 bit run follows
#define QOI_RUN_24     0x3f // 16 bit run follows

#define QOI_RLE_1      0x2d // run length 1, actually a delta op
#define QOI_RUN_0_MAXLEN 1 //max run encodable without an explicit run op
#define QOI_RUN_8_MAXLEN (QOI_RUN_0_MAXLEN+16)
#define QOI_RUN_16_MAXLEN (QOI_RUN_8_MAXLEN+256)
#define QOI_RUN_24_MAXLEN (QOI_RUN_16_MAXLEN+65536)

op count: 265215890
QOI_DIFF_16=30.78%, QOI_INDEX=30.22%, QOI_DELTA=20.50%, QOI_COLOR_RGB=6.53%, QOI_RUN_8=6.46%, QOI_DIFF_24=4.74%, QOI_RUN_16=0.60%, QOI_COLOR_RGBA=0.12%, QOI_COLOR_24=0.03%, QOI_COLOR_A=0.01%

## Benchmarking ../qoitests/images/tango512/*.png -- 1 runs
## Totals (AVG) size: 0x0
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       3.1        20.9         83.95         12.57        51
stbi:         2.1        21.0        122.62         12.46        69
qoi:          0.7         0.9        367.64        284.78        77

## Benchmarking ../qoitests/images/kodak/*.png -- 1 runs
## Totals (AVG) size: 0x0
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       9.9       184.7         39.55          2.13       717
stbi:        10.2       103.0         38.48          3.82       979
qoi:          3.4         5.5        116.04         71.56       777

## Benchmarking ../qoitests/images/misc/*.png -- 1 runs
## Totals (AVG) size: 0x0
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:      12.4       102.7         71.71          8.66       283
stbi:        10.5        96.9         84.60          9.18       415
qoi:          2.7         3.9        330.06        230.68       402

## Benchmarking ../qoitests/images/textures/*.png -- 1 runs
## Totals (AVG) size: 0x0
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       3.0        42.7         42.92          3.04       163
stbi:         2.7        23.5         47.49          5.53       232
qoi:          0.7         1.2        175.07        111.13       176

## Benchmarking ../qoitests/images/screenshots/*.png -- 1 runs
## Totals (AVG) size: 0x0
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:      58.6       632.4        140.56         13.02      2219
stbi:        52.4       774.3        157.16         10.63      2821
qoi:         25.3        30.4        324.87        270.37      2509

## Benchmarking ../qoitests/images/wallpaper/*.png -- 1 runs
## Totals (AVG) size: 0x0
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:     194.9      2938.7         48.09          3.19      9224
stbi:       221.8      1794.5         42.26          5.22     13299
qoi:         75.8        97.6        123.67         95.99     10919

YCgCo lossless transform

Context:
phoboslab/qoi#2

Add pallete in header

Add pallete in header. It should enlarge minimal file by 244 bytes( shrinking original header to 12 bytes make full header a nice 256 bytes).
The best compression results can be obtained if you use a palette obtained not from the first 61 colors, whose hash was not in the palette, but from the most frequent uncompressed colors (which could not be encoded by increments or lengths of the sequence). This can be done by a two-pass algorithm, where in the first pass the algorithm does not use QOI_OP_INDEX, but counts every QOI_OP_RGBA and QOI_OP_RGB that it produces. After the first pass, we have a histogram of uncompressed colors, 61 of the most common colors form a table. In the second pass, we write the file from the results of the first pass, but replace the unpacked colors with their corresponding QOI_OP_INDEX values.
This will probably slow down the encoder few times, but it will still be a lot faster than PNG. But it may allow increasing the compression ratio and speed up decoder. Perhaps this is a reasonable compromise, since decoding and storage are more frequent situations in the image lifecycle.
That modification can work with old onepass algorithm and also speed up decoding.

P.S. Additional modification can be made for the two-pass coding algorithm (I'm not sure that it will improve compression much, need testing). If we make QOI_OP_INDEX not put a pixel into the output directly, but only change the state, then you need any other entry to insert a new pixel into the output stream. This, followed by QOI_OP_DIFF or QOI_OP_LUMA, can virtually enlarge the color table, and this can be used when creating the table. On some images this may improve compression. But it will probably work worse for images with few colors. It should also slow things down, but not too much for large images compared to the basic two-pass algorithm.

Diff from average instead of previous pixel

For the luma operations, I tried to use the average of the previous pixel and the pixel above the current one to compare against instead of comparing to the previous pixel alone. It resulted in significant better compression.
Together with the other tags I chose, I was able to get the compression numbers below.

# Grand total for images
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi:          2.6         3.0        177.46        155.06       463   28.2%
qoi2avg:      2.9         3.7        162.71        125.88       405   24.7%

# Grand total for images-lance
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi:         17.2        15.1        236.14        268.34      2109   13.6%
qoi2avg      17.2        18.7        236.05        217.03      1709   11.0%

qoi2avg.h.txt

Replace hash function with history list

The hash function is bad. I experimented a lot, and no matter what hash multipliers I use and no matter what table size I use, there are always a lot of images in my private test bench of ~10k tiles that utterly fail (in the sense that most, if not all, of the colors in the image have the same hash)

Long story short, if instead of a hash table, I keep a circular buffer of the 64 last encountered colors*, the compression ratio...:

Decreases in 1% of the images (averaging to 100.8% of QOI's ratio)
Stays the same in 35% of the images
Increases in 63% of the images (averaging to 79.8% of QOI's ratio, an improvement of 20.2%!)
Averages to 87% of QOI overall.

Worst ratio is 103% of QOI ()
Best ratio is 31.5% of QOI ()

All of the worst ratios seem to stem from random OP noise, where some OP_INDEX were replaced by OP_RGB (because of a hashtable hit but history table miss), and conversely, some OP_RGB replaced by OP_INDEX.

The decoder becomes slightly slower because of the color search performed at each pixel.

I tried to explain the code changes line-by-line, but it turns out to be a bit more involved than that. For one, the decoder willy-nilly updates the hash table after every OP, even if it's obviously not necessary. This needs to be changed such that the circular history table only gets updated when when the OP is not RUN or INDEX, making an exception in OP_RUN for when px_pos == 0 in which case it needs to be updated anyway. As for updating the list to make it 'circular': I keep a local historyPointer variable that I increase after every store, modulo the table length.

*There's no need to explicitly check if the color to be appended to the list is the same as the previous color appended to the list, because the run length checking implicitly does this already. I expected it to be bad for dithered images, because, well, two colors keep appearing alternately, but the circular history table elegantly prevents this issue as no new colors should be recorded when OP_INDEX was issued.

Secondary index cache with two-byte encoding

Another possible opcode, QOI_OP_INDEX2. This is a two byte encoding with eight bit opcode, that acts as an index into a secondary index cache with 256 values.

Pro:

Bigger cache matches more input, allows some pixels to be stored as two bytes instead of 4 or 5 bytes
Only takes up a single opcode
Reuses the color hash calculation, getting more mileage from it

Con:

Requires an extra 1024 bytes of memory for the secondary cache. This seems doable even for microcontrollers/etc but expert eyes may have a different opinion

Comparison to current master build:

4700u, gcc 11.2.0, -O3
index2 takes master91cc, adds the QOI_OP_INDEX2 opcode by making QOI_OP_RUN store 62 values, and rearranges the decode chain slightly

# Grand total for qoi_benchmark_suite/images
           decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:        6.582      90.359         70.52          5.14       423   23.3%
stbi:          6.492      61.252         71.50          7.58       601   33.2%
qoi-master91cc:2.124       2.584        218.50        179.60       463   25.6%
qoi-index2:    2.410       2.510        192.60        184.90       456   25.2%

@phoboslab If this or any other opcode change is deemed suitable I'll do proper pull requests as and when

qoi-index2.h.txt

Lossy QOI Variant

We could use a bit in the header to indicate a trivial lossy QOI variant (call it QOILY and the original flavor QOILL), reducing from 8 to 4 bits per RGBA channel on encode, reconstitute on decode.

diff --git a/qoi.h b/qoi.h
index 0aec728..6d95a50 100644
--- a/qoi.h
+++ b/qoi.h
@@ -348,6 +348,8 @@ unsigned int qoi_read_32(const unsigned char *bytes, int *p) {
 	return (a << 24) | (b << 16) | (c << 8) | d;
 }
 
+const int lossy = 4; // TODO: make lossy part of qoi_desc.channels.
+
 void *qoi_encode(const void *data, const qoi_desc *desc, int *out_len) {
 	if (
 		data == NULL || out_len == NULL || desc == NULL ||
@@ -380,7 +382,7 @@ void *qoi_encode(const void *data, const qoi_desc *desc, int *out_len) {
 	qoi_rgba_t index[64] = {0};
 
 	int run = 0;
-	qoi_rgba_t px_prev = {.rgba = {.r = 0, .g = 0, .b = 0, .a = 255}};
+	qoi_rgba_t px_prev = {.rgba = {.r = 0, .g = 0, .b = 0, .a = 255>>lossy}};
 	qoi_rgba_t px = px_prev;
 	
 	int px_len = desc->width * desc->height * desc->channels;
@@ -389,12 +391,15 @@ void *qoi_encode(const void *data, const qoi_desc *desc, int *out_len) {
 
 	for (int px_pos = 0; px_pos < px_len; px_pos += channels) {
 		if (channels == 4) {
-			px = *(qoi_rgba_t *)(pixels + px_pos);
+			px.rgba.r = pixels[px_pos+0] >> lossy;
+			px.rgba.g = pixels[px_pos+1] >> lossy;
+			px.rgba.b = pixels[px_pos+2] >> lossy;
+			px.rgba.a = pixels[px_pos+3] >> lossy;
 		}
 		else {
-			px.rgba.r = pixels[px_pos];
-			px.rgba.g = pixels[px_pos+1];
-			px.rgba.b = pixels[px_pos+2];
+			px.rgba.r = pixels[px_pos+0] >> lossy;
+			px.rgba.g = pixels[px_pos+1] >> lossy;
+			px.rgba.b = pixels[px_pos+2] >> lossy;
 		}
 
 		if (px.v == px_prev.v) {
@@ -577,6 +582,13 @@ void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels) {
 		}
 	}
 
+	if (lossy) {
+		for (int px_pos = 0; px_pos < px_len; px_pos++) {
+			unsigned char c = pixels[px_pos];
+			pixels[px_pos] = c | (c<<4);
+		}
+	}
+
 	return pixels;
 }

File sizes are now ballpark comparable to (lossy) JPEG, with much of the benefits of lossless QOI (very simple implementation, very fast encode and decode).

The file size ratios vs JPEG looks better for images/screenshots (roughly 1x) than images/kodak (roughly 2x, but these are photos, possibly even JPEGs to start with).

QOI-Lossy vs JPEG File Sizes

 191138 kodak/kodim01.jpeg
 348490 kodak/kodim01.qoily

 140532 kodak/kodim02.jpeg
 242638 kodak/kodim02.qoily

 105215 kodak/kodim03.jpeg
 213045 kodak/kodim03.qoily

 137832 kodak/kodim04.jpeg
 285990 kodak/kodim04.qoily

 208435 kodak/kodim05.jpeg
 365994 kodak/kodim05.qoily

---

1735098 screenshots/amazon.com.jpeg
2031597 screenshots/amazon.com.qoily

 829789 screenshots/apple.com.jpeg
 805029 screenshots/apple.com.qoily

1215546 screenshots/cnn.com.jpeg
1025061 screenshots/cnn.com.qoily

 348846 screenshots/duckduckgo.com.jpeg
 180998 screenshots/duckduckgo.com.qoily

1254782 screenshots/en.wikipedia.org.jpeg
 851898 screenshots/en.wikipedia.org.qoily

The JPEGs were generated by ImageMagick's convert foo.png foo.jpeg.

QOI-Lossless (the Original QOI) vs PNG File Sizes

 736501 kodak/kodim01.png
 931318 kodak/kodim01.qoill

 617995 kodak/kodim02.png
 723898 kodak/kodim02.qoill

 502888 kodak/kodim03.png
 608468 kodak/kodim03.qoill

 637432 kodak/kodim04.png
 782172 kodak/kodim04.qoill

 785610 kodak/kodim05.png
 975022 kodak/kodim05.qoill

---

6381202 screenshots/amazon.com.png
5511312 screenshots/amazon.com.qoill

2360762 screenshots/apple.com.png
2208829 screenshots/apple.com.qoill

2748636 screenshots/cnn.com.png
2508600 screenshots/cnn.com.qoill

 289177 screenshots/duckduckgo.com.png
 338638 screenshots/duckduckgo.com.qoill

1316655 screenshots/en.wikipedia.org.png
1584675 screenshots/en.wikipedia.org.qoill

QOI-Lossless vs QOI-Lossy File Sizes

 931318 kodak/kodim01.qoill
 348490 kodak/kodim01.qoily

 723898 kodak/kodim02.qoill
 242638 kodak/kodim02.qoily

 608468 kodak/kodim03.qoill
 213045 kodak/kodim03.qoily

 782172 kodak/kodim04.qoill
 285990 kodak/kodim04.qoily

 975022 kodak/kodim05.qoill
 365994 kodak/kodim05.qoily

---

5511312 screenshots/amazon.com.qoill
2031597 screenshots/amazon.com.qoily

2208829 screenshots/apple.com.qoill
 805029 screenshots/apple.com.qoily

2508600 screenshots/cnn.com.qoill
1025061 screenshots/cnn.com.qoily

 338638 screenshots/duckduckgo.com.qoill
 180998 screenshots/duckduckgo.com.qoily

1584675 screenshots/en.wikipedia.org.qoill
 851898 screenshots/en.wikipedia.org.qoily

QOIP format

I've made some progress on a QOI-like format that stores the opcodes used in the header (it's early days, if it breaks let me know). This allows a bitstream and opcodes to be tailored to the input which has a range of mostly positive consequences. The main negative consequence is that the generic encode/decode path is necessarily slower than a fixed-opcode format (mainly because of function pointers and non-optimal opcode ordering), this can be mitigated for non-exhaustive searches if common combinations get "fast-path" implementations (TODO).

https://github.com/chocolate42/qoipond

Currently you don't see the potential compression benefit of the format using qoipbench because crunch-like functionality hasn't been implemented there (TODO within a few days). qoipcrunch is where you can see the benefit, first create a qoip file with qoipconv then call qoipcrunch -in file.qoip -out file2.qoip -level 3 to do a full search of opcode combinations. The level parameter can accept 5 values:

-1: Just try the default combination which is currently an approximation of propA. This is what qoipbench currently does
0: Try a small selection of efficient combinations. This makes level 0 a good middleground between performance and compression, and these combinations are a good target for fast-path implementations to make it even better. Currently the combinations are approximations of propA, propA+OP_A, qoi, delta9h
1: Try the above plus some other combinations that use most of the opcode space
2: Try the above plus more combinations that use most of the opcode space
3: Try all implemented combinations that use most of the opcode space. Right now that means a few hundred combinations, but as more ops get implemented the number of combinations at this level may grow very large. Analogues of demo* ops are next on the implementation list

Example

./qoipcrunch -in test.qoip -level 3 -out test2.qoip
Input has length   233488
New best length    188776 with opstring 000102060e121314
New best length    188544 with opstring 0001070e1213140203
New best length    188184 with opstring 0001070e1513160203
New best length    175544 with opstring 0001070e1513170203
Tried 144 combinations, reduced to 75.183307% of input

Currently implemented ops, the id can be used to decode the above opstrings.

./qoipcrunch -list
id=00, OP_RGB:  4 byte, store RGB  verbatim
id=01, OP_RGBA: 5 byte, store RGBA verbatim
id=02, OP_A:    2 byte, store    A verbatim
id=03, OP_RUN2:   2 byte, 256 value RLE
id=04, OP_RUN1_7: 1 byte, 128 value RLE
id=05, OP_RUN1_6: 1 byte,  64 value RLE
id=06, OP_RUN1_5: 1 byte,  32 value RLE
id=07, OP_RUN1_4: 1 byte,  16 value RLE
id=08, OP_RUN1_3: 1 byte,   8 value RLE
id=09, OP_RUN1_2: 1 byte,   4 value RLE
id=0a, OP_RUN1_1: 1 byte,   2 value RLE
id=0b, OP_RUN1_0: 1 byte,   1 value RLE
id=0c, OP_INDEX7: 1 byte, 128 value index cache
id=0d, OP_INDEX6: 1 byte,  64 value index cache
id=0e, OP_INDEX5: 1 byte,  32 value index cache
id=0f, OP_INDEX4: 1 byte,  16 value index cache
id=10, OP_INDEX3: 1 byte,   8 value index cache
id=11, OP_INDEX2: 1 byte,   4 value index cache
id=12, OP_DIFF:       1 byte delta,   vr  -2..1,  vg  -2..1,    vb  -2..1
id=13, OP_LUMA2_464:  2 byte delta, vg_r  -8..7,  vg -32..31, vg_b  -8..7
id=14, OP_RGB3:       3 byte delta,   vr -64..63,  g,           vb -64..63
id=15, OP_LUMA1_232:  1 byte delta, vg_r  -2..1,  vg  -4..3,  vg_b  -2..1
id=16, OP_LUMA3_676:  3 byte delta, vg_r -32..31, vg -64..63, vg_b -32..31
id=17, OP_LUMA3_4645: 3 byte delta, vg_r  -8..7,  vg -32..31, vg_b  -8..7  va -16..15

Make pixel differences unbiased

HW implementations would benefit from it, as the circuitry needed to sign-extend is simpler than the one to account for biases. Context: phoboslab/qoi#28 (comment)

Add a version number (or equivalent) to the header

Context:
phoboslab/qoi#12

Experiments with luma232, luma686 and gray tags

I attached my own idea for a balanced list of tags.

Some things I noticed while experimenting ...

A separate gray tag helps with grayscale images like B/W photos.
For a luma232, it improves things a bit to make the range of the red and blue diffs (-2 .. 1 or -1 .. 2) dependent of the sign bit of the green diff.
Luma686 seems to be able to cover a lot of pixels.

I tried not to overdo it with the amount of tags and tried to get a decent amount of compression.

Grand total for images: 420 kb (attached code) vs 463 kb (qoi)
Grand total for images-lance: 1760 kb (attached code) vs 2109 kb (qoi)

Timings on my computer seem to be very inconsistent for some reason.
Maybe someone else can check if interested.
qoi2.h.txt

Opcode statistics

This issue isn't a proposal. It's just a place to hang some summary statistics that might inform other proposals.

The test images are linked to from https://phoboslab.org/files/qoibench/ (with misc/lenna.png removed).

The five directories form two broad categories. Photographic (kodak, textures, wallpaper) and not photographic (misc, screenshots). For QOI, the first category is probably less important than the second, as they are the sort of images for which people will use lossy compression (e.g. JPEG) rather than lossless compression (e.g. PNG, QOI).

Tiling (allowing parallel processing)

Context:
phoboslab/qoi#33

There are a lot of unreachable combinations of bits and ways to copy the previous color to the current pixel

I believe QOI_RLE_8 is the best way to copy the previous pixel, but there are many other ways to do that.

QOI_INDEX always has the previous color, although we don't need it. Maybe we can put the colors in the table with a one chunk delay to fix this?

QOI_DIFF_8, QOI_DIFF_16 and QOI_DIFF_24 all have 000 diffs bit combinations. Maybe if 2 channels are equal to 0, we can expand the range of another one channel? For example, if channels g and b are 0, then we can change the range of channel r from [-8, 7] to [-8, 1]U[1, 8]. One of the problems is that we can't choose which channel we want to expand, so we need to choose a fixed one. I believe that channels r or g are more preferable to expand this way than channel b.

QOI_COLOR has a similar issue too, we can choose 0 channels to writte.

7-bit index (instead of 6)

Here's a possible re-packing of the bits, increasing the size of the color cache from 64 to 128 entries.

#define QOI_INDEX   0x00 // 0xxxxxxx  +0 bytes  i7
#define QOI_DIFF_8  0x80 // 10xxxxxx  +0 bytes  r2g2b2
#define QOI_DIFF_16 0xc0 // 1100xxxx  +1 bytes  r4g4b4
#define QOI_DIFF_24 0xd0 // 1101xxxx  +2 bytes  r5g5b5a5
#define QOI_DIFF_32 0xe0 // 11100000  +3 bytes  r8g8b8
#define QOI_DIFF_40 0xe1 // 11100001  +4 bytes  r8g8b8a8
#define QOI_RUN_8   0xe0 // 111xxxxx  +0 bytes  from
                //  0xe2    11100010            up to
                //  0xfd    11111101            run length   2 ..=    29
#define QOI_RUN_16  0xfe // 11111110  +1 bytes  run length  30 ..=   285
#define QOI_RUN_24  0xff // 11111111  +2 bytes  run length 286 ..= 65821

On my laptop, a win for images/misc, a clear win for images/screenshots and same-ish for the photos.

        decode ms   encode ms   decode mpps   encode mpps   size kb
images/kodak
qoi-before:   6.4         8.3         61.02         47.11       771
qoi-after:    6.0         9.3         65.64         42.21       772
images/misc
qoi-before:   5.8         6.1        154.48        146.05       400
qoi-after:    5.5         5.8        162.43        154.29       400
images/screenshots
qoi-before:  53.9        48.6        152.63        169.25      2582
qoi-after:   49.2        40.9        167.28        201.09      2481
images/textures
qoi-before:   1.6         2.0         80.17         65.13       184
qoi-after:    1.4         2.0         90.81         65.21       180
images/wallpaper
qoi-before: 137.6       153.6         68.10         61.02     10640
qoi-after:  126.9       155.8         73.83         60.16     10669

Entropy coding test

Inspired by this issue, and also this QOI issue I did some test of 'entropy coding' for qoi (compressing with regular compressors).
Used subset from benchmark images and default compression levels (except zstd, which was set to 12, to somewhat match gzip level 6.
Scripts used to make this test, in case someone wanted to test it themself are also attached.

That's the results: (gz-H is pure Huffman from deflate (`pigz -H`)), rans_jks is jkbonfield's rans_static)

20737618 bmp
 7035249 qoi
 6630663 qoi.lz4
 5881278 qoi.rans_jkb
 5850003 qoi.gz-H
 5721098 png
 5476417 qoi.zst
 5471989 qoi.gz
 5113443 qoi.lzma
 5102454 qoi.bz2

sorted by ratio

method	ctime	cspeed	dtime	dspeed	compr.	ratio
qoi.bz2	2.762	7.5082	1.491	13.9085	5102454	0.246048
qoi.lbz2	2.124	9.7634	1.122	18.4827	5102454	0.246048
qoi.lzma	5.934	3.4947	1.534	13.5187	5113443	0.246578
qoi.gz	1.410	14.7075	0.543	38.1908	5471989	0.263868
qoi.zst	2.061	10.0619	0.491	42.2355	5476417	0.264081
png	10.308	2.0118	2.361	8.7834	5721098	0.275880
qoi.gz-H	0.859	24.1416	0.549	37.7734	5850003	0.282096
qoi.rans_jkb	0.575	36.0654	0.448	46.2893	5881278	0.283604
qoi.lz4	0.533	38.9074	0.409	50.7032	6630663	0.319741
qoi	0.437	47.4545	0.332	62.4627	7035249	0.339251

sorted by ctime

method	ctime	cspeed	dtime	dspeed	compr.	ratio
qoi	0.437	47.4545	0.332	62.4627	7035249	0.339251
qoi.lz4	0.533	38.9074	0.409	50.7032	6630663	0.319741
qoi.rans_jkb	0.575	36.0654	0.448	46.2893	5881278	0.283604
qoi.gz-H	0.859	24.1416	0.549	37.7734	5850003	0.282096
qoi.gz	1.410	14.7075	0.543	38.1908	5471989	0.263868
qoi.zst	2.061	10.0619	0.491	42.2355	5476417	0.264081
qoi.lbz2	2.124	9.7635	1.122	18.4827	5102454	0.246048
qoi.bz2	2.762	7.5082	1.491	13.9085	5102454	0.246048
qoi.lzma	5.934	3.4947	1.534	13.5187	5113443	0.246578
png	10.308	2.0118	2.361	8.7834	5721098	0.275880

sorted by dtime

method	ctime	cspeed	dtime	dspeed	compr.	ratio
qoi	0.437	47.4545	0.332	62.4627	7035249	0.339251
qoi.lz4	0.533	38.9074	0.409	50.7032	6630663	0.319741
qoi.rans_jkb	0.575	36.0654	0.448	46.2893	5881278	0.283604
qoi.zst	2.061	10.0619	0.491	42.2355	5476417	0.264081
qoi.gz	1.410	14.7075	0.543	38.1908	5471989	0.263868
qoi.gz-H	0.859	24.1416	0.549	37.7734	5850003	0.282096
qoi.lbz2	2.124	9.7635	1.122	18.4827	5102454	0.246048
qoi.bz2	2.762	7.5082	1.491	13.9085	5102454	0.246048
qoi.lzma	5.934	3.4947	1.534	13.5187	5113443	0.246578
png	10.308	2.0118	2.361	8.7834	5721098	0.275880

As can be seen lz4 is inefficient in this application, which becomes clear when one realises that lz4 does not use any entropy coding, and the entropy coding, even pure prefix code like huffman from deflate, is what makes a difference in this contest. Lzma is very tight but still loses to bzip2, both in terms of compression and speed. Zstd turns out to be worse and slower than gzip, only decompression is slightly faster but it doesn't make up to it.
RANS static is on par with huffman compression though it's faster, specially with compression.

All in all, the best compressor (by ration) is bzip2, and by speed gzip/deflate, which beats png both, with compression (4%-5%) and with speed (~7 times).

qoiformat-bench-1b.zip

Combine Index + Diff and Index + Luma

I spent a few hours hacking on the original qoi image format, adding a combined index+diff and Index+luma opcode.
From what I've seen (from my images) the run length encoding very rarely exceed 10.
From my understanding of the format the index or luma opcodes can only apply on the previously seen pixel, which is believe is good for gradient. If theses two opcode cannot be applied the only solution is to push a new pixel color with the rgb and rgba opcode.
I had the feeling (biased) that combining index+diff or index+luma (applying the diff/luma opcode on a value from the indexed pixel) could result in a better compression.

That's what I've done and I've confirmed that it indeed "works" (to be taken with a grain of salt).
Also this doesn't increase too much the decoding complexity.

My current implementation is stupid as it goes through all 64 pixel in the "hash" table and tries to find a pixel that's close enough.
Another possibility would be to use another table of pixels with there values truncated/aligned (not sure how), with the lower bits encoded in the "diff" opcode.

edit: the branch can be found here: https://github.com/jmaselbas/qoi/tree/qoi_idiff_iluma

RIFF container

Here is a concrete suggestion for an extensible header / container format. There are reserved-zero bits that are effectively a version number (#2). It can hold metadata such as color profiles (#7). It allows parallelized decoding (#9). Etc.

It builds on RIFF and vanilla (headerless) QOI, heavily inspired by how WebP Extended builds on RIFF and VP8/VP8L. We could call this format "qoiriffic", file extension "qoir".

RIFF

A quick overview: RIFF is a 12 byte header then a sequence of chunks.

The header is:

4 bytes "RIFF",
a u32le File Size (yeah, for streaming encoders, use vanilla QOI with a different header / container) and
4 more bytes (in this case, "QOIR"). QOIR is this container format. QOI (or perhaps QOIC or QOIP) is the pixel-compression bytecode thing (#8). The "upstream" stand-alone QOI format with a 14-byte header could be called QOIF.

Each chunk is:

a u32le Chunk FourCC (e.g. "EXIF"),
a u32le Chunk Size and
a variable sized Chunk Payload (padded to an even number of bytes).

QOIX

The first chunk is a "QOIX" chunk, which is pretty much exactly like WebP's "VP8X" chunk, except that the Alpha (L) and Animation (A) bits must be zero. Alpha is already part of vanilla QOI (unlike VP8). Animation doesn't seem necessary but we could add it (copying WebP/VP8X) if we wanted to.

This "QOIX" chunk has reserved-zero bits (the high 24 bits of the u32le after "QOIX") that can act as version number. The low 2 bits could indicate the presence of "COMP" and "TILE" chunks.

TBD: After that is an optional "COMP" chunk, for configuring custom commpression schemes (e.g. xz, zstd).

TBD: After that is an optional "TILE" chunk, for cutting a larger image into uniform-sized tiles (and maybe specifying a background color for missing tiles).

After that is an optional "ICCP" chunk, again just like WebP/VP8X.

After that are optional "pre-QOIT" chunks.

After that comes one or more "QOIT" chunks: QOI tiles.

After that are optional "post-QOIT" chunks.

After that is an optional "EXIF" chunk, again just like WebP/VP8X.

After that is an optional "XMP " chunk, again just like WebP/VP8X.

QOIT

Each QOI tile chunk is:

a u32le Chunk FourCC ("QOIT"),
a u32le Chunk Size and
a variable sized Chunk Payload (padded to an even number of bytes, like all RIFF chunks are).

That Payload starts with a u32le Flags field. From LSB to MSB:

2 bits Transparency:
- 0: no alpha (equivalent to "3 channel BGR / RGB" or "4 channel BGRX / RGBX")
- 1: non-premul alpha
- 2: reserved
- 3: premul alpha
1 bit TileSubrectangle.
2 bits Compression:
- 0: no compression
- 1: LZ4 block compression
- 2: reserved
- 3: Custom compression
3 bits Lossiness (bit depth). Zero means lossless. Non-zero means lossy. For example, a value of 2 means that the 8-bit pixel values were right-shifted by 2 prior to encoding. Decoding undoes that (after executing all of the QOI bytecode) as best it can: each uint8 pixel value p is replaced by (((p&0x3F) << 2) | ((p&0x3F) >> 4)).
24 bits reserved.

After the 4 byte Flags:

If the TileSubrectangle bit was set then 4 u32le values x0, y0, x1, y1 that define the top-left (inclusive) and bottom-right (exclusive) of this tile. TBD: some of this (e.g. if tiles have uniform width and height) could be factored out into a global "TILE" chunk.
If the Custom compression bits were set then a u32le Compression Configuration Size value and then CCS bytes to define the compression codec. For example CCS=4 may be followed by the 4 bytes "zstd" for Zstandard with no further configuration. Qoiriffic decoders must support LZ4 block compression (given in the header bit) but aren't required to support any custom compression codecs (the decoding fails as if it was an unsupported pre-QOIT chunk). TBD: again, some of this could be factored out into a global "COMP" chunk.

After that, vanilla QOI (i.e. bytecode) without the 14-byte header (but with the 8-byte padding trailer).

Pre-QOIT and Post-QOIT Chunks

These are extensions - arbitrary chunks that not all decoders are required to support. But if you control all of the producer and consumer implementations (e.g. a video game's first party assets), feel free to put your custom extensions here.

Being before or after the QOIT chunks corresponds to being a critical or ancillary chunk in the PNG spec. Unsupported pre-QOIT chunks (e.g. some sort of Hilbert curve pixel traversal configuration #6 or subtract-green or palette transform) means that the overall decoding fails but unsupported post-QOIT chunks (e.g. some sort of thumbnail or modification-time representation) can be ignored.

This document does not define any extensions.

Analyzing images

I tried to analyze both "images" and "images-lance" to see what operation works best for a given number of bits using two different prediction methods.

 B
Ax

A ; take the difference from the current pixel and the previous one.
(A+B)/2 ; take the difference from the current pixel and the rounded average of the pixels left and above the current one.

To make things easier I discarded the top row and left column of every image so pixels A and B are always available.
I guess discarding that little information shouldn't have a significant impact on the results.
I also discarded all pixels equal to the one at the left of the current pixel since a run operation takes care of those.
For the remaining pixels I counted what percentage of those pixels would be covered by a specific operation if it were the only operation that would be used.
The alpha channel was ignored.

My conclusion is that for any given number of bits, luma always worked better as diff for these specific image collections and in general the second prediction method works better compared to the first one.
Analyzing the pictures on my computer (mainly images taken with a digital camera) confirms this conclusion.

See the attached files for the data as a .csv file.
Column 1 : total bits used
Column 2 : prediction method
Column 3 : operation that was used
Column 4 : percentage covered

Maybe this information helps to decide which operations to select.

images-lance.csv.txt
images.csv.txt

Demo 10: a Proof of Concept

Here are some pretty sweet throughput numbers on my laptop (with similar compression sizes) for a proof of concept, combining little-endianness (#3) with 7-bit indexes (#20). Little-endianness in particular allows us to remove a lot of branching in the decoder's inner loops.

I can't think of a good name for it right now, so I'm just going to call it "Demo 10".

        decode ms   encode ms   decode mpps   encode mpps   size kb
images/kodak
qoi-master:   6.4         8.4         61.15         47.01       771
qoi-experi:   6.9         9.4         57.36         41.91       700
qoi-demo10:   4.0         6.8         98.25         58.03       772
demo10/master ratio:                   1.61x         1.23x
images/misc
qoi-master:   5.7         6.0        154.59        147.09       400
qoi-experi:   6.0         6.6        148.95        134.45       407
qoi-demo10:   2.7         4.3        327.75        204.86       400
demo10/master ratio:                   2.12x         1.39x
images/screenshots
qoi-master:  54.4        48.8        151.33        168.53      2582
qoi-experi:  53.9        51.3        152.61        160.47      2491
qoi-demo10:  28.1        33.6        292.57        244.87      2481
demo10/master ratio:                   1.93x         1.45x
images/textures
qoi-master:   1.8         2.1         74.17         62.86       184
qoi-experi:   1.7         2.3         75.03         56.17       179
qoi-demo10:   0.9         1.5        138.48         85.30       180
demo10/master ratio:                   1.87x         1.36x
images/wallpaper
qoi-master: 138.5       154.1         67.69         60.82     10640
qoi-experi: 143.3       155.7         65.38         60.19     10170
qoi-demo10: 101.1       124.1         92.70         75.50     10669
demo10/master ratio:                   1.37x         1.24x

Details:

qoi-master is the phoboslab/master branch, commit e9069e1.
qoi-experi is the phoboslab/experimental branch, commit cbb62ea.
qoi-demo10 is the little-endian index-7 "demo 10". Source code is in qoi-demo10.h.txt (it replaces qoi.h).

License concern with creating a new format

What is the extent of the restrictions we might be subjected to by creating a new file format based on qoi and using the same encoding scheme?
Should we rename the format? (not just appending a '2')
How much changes to the qoi specification must we have to do to be freed from the original license?

Color spaces (sRGB etc)

Context:
phoboslab/qoi#25

Alternative traversal (Hilbert curve, Z curve, etc)

Context:
phoboslab/qoi#5

Demo 20: a Proof of Concept

Starting from commit aefa0f7 (2021-12-17) from the phoboslab/master branch, qoi-op-a adds an QOI_OP_A (like the old QOI_OP_COLOR(A)), replacing "QOI_OP_RUN with run length 62".

qoi-demo20 builds on that by:

using a FIFO color cache (instead of a hashed color cache), like qoi-fifo2e (#19 (comment)), except that we get all of the encoding speed back by re-introducing hashing just for the encoder. Different encoder implementations can bikeshed on the particular hash function as much as they want, but hashing is kept out of the file format specification (simpler) and out of the decoder implementation (faster). No SIMD required.
the color cache's initial state was tweaked to include opaque black (consistent with the px_prev initial value, a la phoboslab/qoi#75 (comment)). "QOI_OP_RUN with run length 1" is now redundant (there's always an equivalent QOI_INDEX op), and while it's not in qoi-demo20, we could re-assign the opcode (or bump run lengths by 1).
various optimizations like qoi-demo2e (phoboslab/qoi#71 (comment)), an optimized version (with no file format changes) of phoboslab/master commit 2ee2169.

The images test suite is the 'official' one from phoboslab/qoi#69 and images/icon_512 within that is highlighted for decoder/encoder speed gains on top of phoboslab/master. The images-lance test suite comes from phoboslab/qoi#69 (comment) and show 1.13x compression ratio gains on top of phoboslab/master.

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

## Total for images/icon_512
qoi-demo2e:   0.9         1.4        300.26        183.00        85    8.4%
qoi-aefa0f7:  1.6         1.6        165.18        167.23        85    8.4%
qoi-op-a:     1.4         1.2        180.92        213.11        79    7.8%
qoi-demo20:   0.8         1.2        322.69        227.86        79    7.8%

# Grand total for images
qoi-demo2e:   3.8         5.3        121.20         87.21       463   25.6%
qoi-aefa0f7:  4.8         5.6         97.28         83.08       463   25.6%
qoi-op-a:     4.7         4.8         98.25         96.80       462   25.5%
qoi-demo20:   3.6         4.9        127.71         95.65       459   25.4%

# Grand total for images-lance
qoi-demo2e:  26.9        30.8        151.25        131.97      2109   13.3%
qoi-aefa0f7: 35.4        32.4        114.61        125.29      2109   13.3%
qoi-op-a:    35.8        29.1        113.38        139.46      1862   11.7%
qoi-demo20:  27.1        28.2        150.12        143.97      1859   11.7%

qoi-demo20.h.txt

Hash function

Currently, it's:

#define QOI_COLOR_HASH(C) (C.rgba.r ^ C.rgba.g ^ C.rgba.b ^ C.rgba.a)
// followed by an "&63" which drops the high 2 bits of each channel.

This is about as simple as it gets, which has its own virtues. There may be better ones, although be aware that better compression (size) might mean worse performance (throughput).

Some discussion (amongst other things) is at phoboslab/qoi#48

The hash function can be improved

The hash function is pretty bad.

Try it on this image:

Six of the seven colors have the same hash value. The colors are all cyan, and from a 12BPP RGB cube.

For R=0, G=B=16*N (N integer), and A=255, the hash is always 53.
The reason for this is that the sum of the hash multipliers for G and B (5 and 7 respectively), multiplied by 16 is 192, which is an integer multiple of the table size (64).

I reduced the table size to 63 (which is relatively prime to 256 but not the prime multipliers), and the compression ratio immediately improved by 6.95%.

I changed multipliers to [5, 11, 13, 17] which are relatively prime to 63 (also taking into account their sums), and the ratio improvement is now 6.36%

🤔

I think it's random noise at this point because just reordering the multipliers changes the ratio by even larger amounts.

Anyway, my point stands, which is that a hash table of 64 is bad.

[edit] curiously this big improvement only happens when using signed rgba values. When using unsigned, the ratios are 3.96% and 4.78%, respectively (new multipliers being better). Note that with a hash table size of 64, signed and unsigned calculations both generated identical hash values. The signed version is trickier however, because in most languages, % (remainder) gives a negative result when the dividend is negative (i.e. truncMod). What's needed here is a true modulo operator (floorMod).

But wait, we can do much better: #39

More benchmark suite image variety (esp. re alpha)

Per #14

The test images are linked to from https://phoboslab.org/files/qoibench/ (with misc/lenna.png removed).

and

x kodak misc screenshots textures wallpaper total average
QOI_COLOR (40) 0.000% 9.986% 0.000% 0.000% 0.000% 1.997%

It's not surprising in retrospect that, other than misc, nothing uses 5-byte QOI_COLOR codes, as the source pixels are fully opaque: they're photographs and screenshots.

Should we source some more PNG-images-with-alpha for our test suite? Perhaps some logos or SVG-to-PNG conversions??

Similar but not exactly the same:
phoboslab/qoi#20

Separate file format from compression scheme

Context:
phoboslab/qoi#29

Zstandard Compression

I imagine I'm not the first one to look into this, but I haven't seen another post about this yet so I may as well throw the idea onto the playing field. Apologizes if this is the wrong place for this 😅

On screenshots, I got anywhere from 15% to sometimes 50% on zstd levels 1-3, generally outperforming even fairly well-optimized PNGs in compression ratio.
For other image types it seems to be around 10-15%.

Here's a more thorough comparison of the en.wikipedia.org.png screenshot from the original "test bench":

Compression	Size	Ratio
original file	1,316,655	1.000
oxipng -D -o3	1,046,134	0.794
qoi	1,498,761	1.138
qoi + zstd -1	955,221	0.725
qoi + zstd -3	809,694	0.615

Tested with the master branch of QOI (2ee2169), zstd v1.5.0, and oxipng 5.0.1.

Haven't done a complete comparison across the entire test bench yet; if there's interest I might get around to it though.

Pixel format wishlist

It would be useful if the pixel formats accepted could expand beyond 8 bit RGB/RGBA 4:4:4, but people have different opinions about where to expand and how to do it.

This guy mentions 10/12/14/16 4:4:4 RGB as useful, which a single 16 bit implementation may be able to usefully cover without undue complication? phoboslab/qoi#95 (comment)

I like the idea of supporting 8/10 bit YCbCr 4:4:4/4:2:2/4:2:0, that way the bitstream could be used as the basis for a video codec without heavy modifications. As a streaming pixel engine we don't necessarily have the luxury of being able to jump back to handle subsampling properly. Input could be processed two rows at a time, allowing 4x2 blocks to be used, or enforce the IO to be re-arranged into a 4x2 block streaming form (ie YYYYYYYYCbCbCrCr or similar for 4:2:0). It looks like YUV format allows for many memory layouts which sounds like a complication, but takes the ordering optimisation out of our hands. It means we can get away with enforcing 4x2 block streaming IO, ie any other YUV form is handled by an intermediary handler piping between file and encoder/decoder.

How to actually compress YCbCr/subsampled is an open question. Single pixel diff encoding doesn't seem to do much, and if we do read in blocks it seems wasteful not utilise that.

Some impovements to QOI_DIFF_16, QOI_DIFF_24 and QOI_COLOR

This is related to #14

Not sure, but maybe we can try to optimize the use of DIFF_16 and DIFF_24 by using an opcode that would store 2 color differences in 24 bits? I believe this would be more effective in terms of memory.

3 bits for opcode
554 rgb diff
544 rgb diff

Basically the same precision as in DIFF_16, but in 12 bits per color.
(upd: I was wrong, 443 and 433 is more realistic)

I've also been thinking about opcodes that could store the index diff with a color diff, but I'm not sure how efficient that would be..
Also maybe we can improve the color difference range by shifting the last bit to left, so that we could store ranges 0-15 and 32-47 instead of 0-31 in 5 bits. (Altho we need another opcode with the 0-31 range)
I was also collecting some statistics when I was working with QOI, and noticed that most of the QOI_COLOR opcodes are ~4 bytes long. But to be honest, I don't remember which collection of images I checked it on, so it needs to be rechecked. But if these are valid results, then maybe we can somehow optimize the use of QOI_COLOR? Maybe instead of choosing between saving and not saving the value of an 8 bit channel, we can choose between a 4 bit difference and an 8 bit value? Although this solution has serious padding issues, for example, if we had rgb in 4 bit mode with alpha in 8 bit mode, the opcode would be 28 bits long... Therefore, instead of storing 4 bits with information for each channel, some kind of enumeration should be used.

(1 variation with 4 8-bit vals, 1 variation with 3 8-bit vals, 6 variations with 2 4-bit diffs and 2 8-bit vals, 3 variations with 2 4-bit diffs and 1 8-bit value, 4 variations with 1 8-bit value)

Change QOI_COLOR to QOI_DIFF_COLOR

The decoder currently says:

if ((b1 & QOI_MASK_4) == QOI_COLOR) {
  if (b1 & 8) { px.rgba.r = bytes[p++]; }
  if (b1 & 4) { px.rgba.g = bytes[p++]; }
  if (b1 & 2) { px.rgba.b = bytes[p++]; }
  if (b1 & 1) { px.rgba.a = bytes[p++]; }
}

The proposal is to change the = to +=, consistent with the other QOI_DIFF_ETC codes. There's no change in expressiveness (with uint8_t modular arithmetic). The encoder changes correspondingly, obviously.

The potential benefit is that the four opcodes QOI_DIFF_{8,16,24,COLOR} could become one single if-else-branch instead of four (especially when combined with #3), where the shifts and masks for each channel come from a (branch free) look-up table instead of if-else chains. This might have a performance impact. It might not.

16-bit depth support

Context:
phoboslab/qoi#2

QOI-plane, a 1 and 2 channels (grey and grey + alpha) codec like QOI

This "QOI-plane" codec breaks byte alignment and works in groups of 4-bits in order to encode greyscale.
https://github.com/AuburnSounds/gamut/blob/main/source/gamut/codecs/qoiplane.d#L79

Typically it encodes a uint8 image with 4.55 bit / pixel, and a luminance + alpha image with lower number of bits, but I fear my test suite is too easy with alpha + grey images. Didn't find the opcode room to have a FIFO.

So it beat a qoiavg2 that would instead encode a greyscale input as rgb color, which leads to more than 8-bit / per pixel.
All these numbers are after a LZ4, I haven't tried a naked QOI-plane.

/// Encoding:
///
/// QOIPLANE_DIFF1     0xxx                          => diff -4..+3 vs average of rounded up left pixel and top pixel
/// QOIPLANE_DIFF2     100x xxxx                     => diff -16..15 vs average of rounded up left pixel and top pixel
/// QOIPLANE_ADIFF     1011 xxxx                     => diff -7..+7 in alpha channel
/// QOIPLANE_LA        1011 0000 xxxx xxxx aaaa aaaa => encode direct full values
/// QOIPLANE_DIRECT    1010 xxxx xxxx                => direct value
///                                                   If channels == 2 and the last opcode is not a QOIPLANE_ADIFF
///                                                   then QOIPLANE_DIRECT encodes an alpha value.
/// QOIPLANE_REPEAT1   11xx                          => repeat 1 to 3 times the last pixel
/// QOIPLANE_REPEAT2   1111 xxxx xxxx                => repeat 4 to 258 times a pixel.
///                                                     (1111 1111 1111 disallowed, indicates end of stream)

Overall status.

Huffman trees are an easy way to outperform PNG

Where QOI performs well, adding Huffman trees makes it outperform even libpng. The bad news: It makes encoding the image a 2-pass operation (decoding remains 1-pass), so you either have to store the whole result in memory, or encode the raw QOI image twice (once to get frequencies per input byte, and once more to do the actual encoding).

Did some experiments with 2+n Huffman trees, where n is the number of channels in the image (so, 5 or 6 trees).

The format I'm doing here replaces one byte in the header ("hoif" instead of "qoif" - H being short for "Huffman" but also so one could call it "Hardly OK Image Format"), adds 1 extra byte of header after the usual QOI header (I might want to extend this if I take this seriously), then the rest of the image is compressed as if it's a full command stream (including the 8-byte footer).

Note that my prototype is written entirely in Python, so I cannot give a fair performance analysis (it can take a few seconds to compress a large image).

Consider the command list here:

# QOI_OP_INDEX 00iiiiii
# QOI_OP_DIFF  01rrggbb
# QOI_OP_LUMA  10gggggg rrrrbbbb
# QOI_OP_RGB   11111110 rrrrrrrr gggggggg bbbbbbbb
# QOI_OP_RGBA  11111111 rrrrrrrr gggggggg bbbbbbbb aaaaaaaa
# QOI_OP_RUN   11nnnnnn

The first byte uses the command tree.
If we are using QOI_OP_LUMA, the second byte goes to the luma byte tree.
If we are using QOI_OP_(RGB|RGBA), then each byte goes to the tree for the respective channel in the per-channel trees.

Currently, the trees are placed in this order, as 256 bytes where each byte indicates the bit length for that input byte for a canonical Huffman tree:

command tree
luma byte tree
red tree
green tree
blue tree
alpha tree (only if RGBA)

From qoi_test_images.zip, we get these results for photos:

 652383  kodim10.qoi
 593463  kodim10.png
 500083  kodim10.hoi

 675251  kodim23.qoi
 557596  kodim23.png
 519130  kodim23.hoi

1521134  wikipedia_008.qoi
1344960  wikipedia_008.png
1291391  wikipedia_008.hoi

As for RGBA... well, it doesn't quite get to PNG levels (QOI doesn't fare well when the alpha channel isn't mostly 0x00 and 0xFF) but it does still improve the ratio:

 519653  dice.qoi
 349827  dice.png
 411231  dice.hoi

Big-endian vs little-endian

Context:
phoboslab/qoi#10
phoboslab/qoi#36

Index Cache Eviction Strategy

It's possible the default eviction strategy of "always evict" could be improved upon, here's some preliminary data. Based on luma61.run64, which is a tweaked version of the latest experimental branch. It doesn't matter much what the build is as the point is to compare between strategies.

qoi-l61r64: Vanilla for comparison
qoi-l61r64c*: A cache strategy described below

//Increment a value whenever an index is used
//When we try to evict an index we check this value. On zero we evict,
//otherwise the value is reduced depending on strategy:
// 0: vanilla, always evict
// 1: After failing to evict, set to zero
// 2: After failing to evict, decrement
// 3: After failing to evict, divide by 2
// 4: After failing to evict, divide by 4
// 5: After failing to evict, divide by 8
// 6: After failing to evict, divide by 16

## Benchmarking ../qoitests/images/tango512/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              3.1        20.5         85.13         12.78        51
stbi:                2.1        20.5        124.91         12.77        69
qoi-l61r64:          0.6         0.8        476.06        328.86        86
qoi-l61r64c1:        0.7         0.9        366.43        304.27        83
qoi-l61r64c2:        0.7         0.9        358.94        302.28        86
qoi-l61r64c3:        0.7         0.9        359.54        307.98        83
qoi-l61r64c4:        0.7         0.8        361.45        310.28        83
qoi-l61r64c5:        0.7         0.8        362.30        308.45        83
qoi-l61r64c6:        0.7         0.8        363.78        310.00        83

## Benchmarking ../qoitests/images/kodak/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              9.7       180.1         40.61          2.18       717
stbi:               10.2        99.9         38.71          3.94       979
qoi-l61r64:          2.6         4.4        153.22         89.33       675
qoi-l61r64c1:        3.5         5.1        111.14         77.45       678
qoi-l61r64c2:        3.8         5.0        102.18         78.11       680
qoi-l61r64c3:        3.7         4.8        105.19         82.21       679
qoi-l61r64c4:        3.5         4.7        110.82         84.04       678
qoi-l61r64c5:        3.6         4.8        108.33         82.41       678
qoi-l61r64c6:        3.5         4.7        111.40         83.19       678

## Benchmarking ../qoitests/images/misc/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:             12.0       102.9         73.98          8.64       283
stbi:               10.6        97.6         84.11          9.11       415
qoi-l61r64:          2.1         3.2        417.70        278.41       503
qoi-l61r64c1:        2.8         3.4        319.73        259.87       500
qoi-l61r64c2:        2.9         3.5        309.54        251.76       495
qoi-l61r64c3:        3.0         3.5        298.79        254.31       498
qoi-l61r64c4:        2.9         3.4        311.53        260.90       499
qoi-l61r64c5:        2.8         3.4        318.56        259.64       500
qoi-l61r64c6:        2.9         3.4        311.75        258.41       500

## Benchmarking ../qoitests/images/textures/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              3.1        43.3         41.76          3.00       163
stbi:                2.8        24.2         46.19          5.37       232
qoi-l61r64:          0.7         1.1        199.69        115.21       179
qoi-l61r64c1:        0.8         1.1        160.15        114.00       177
qoi-l61r64c2:        0.8         1.2        156.88        111.28       179
qoi-l61r64c3:        0.8         1.1        158.34        116.67       177
qoi-l61r64c4:        0.9         1.1        152.24        113.22       177
qoi-l61r64c5:        0.9         1.2        146.05        107.64       177
qoi-l61r64c6:        0.8         1.2        156.20        112.66       177

## Benchmarking ../qoitests/images/screenshots/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:             57.8       622.7        142.49         13.22      2219
stbi:               50.7       748.1        162.22         11.00      2821
qoi-l61r64:         20.6        26.6        400.49        309.33      2475
qoi-l61r64c1:       26.9        27.6        305.70        297.89      2455
qoi-l61r64c2:       27.8        28.8        295.87        285.81      2467
qoi-l61r64c3:       27.3        27.9        301.21        295.01      2451
qoi-l61r64c4:       27.8        27.4        296.28        300.17      2454
qoi-l61r64c5:       26.9        27.5        305.59        299.68      2455
qoi-l61r64c6:       26.8        27.4        307.46        300.61      2455

## Benchmarking ../qoitests/images/wallpaper/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:            189.9      2893.7         49.35          3.24      9224
stbi:              221.1      1786.6         42.40          5.25     13299
qoi-l61r64:         64.3        85.4        145.69        109.80      9964
qoi-l61r64c1:       83.0        93.8        112.97         99.91     10021
qoi-l61r64c2:       89.1       100.0        105.23         93.76     10179
qoi-l61r64c3:       87.1        94.2        107.55         99.47     10079
qoi-l61r64c4:       84.8        92.8        110.48        101.01     10039
qoi-l61r64c5:       83.4        91.5        112.32        102.46     10027
qoi-l61r64c6:       82.9        91.5        112.99        102.43     10023

Performance tanks hard so even if it was a decent gain in compression (it isn't) it still wouldn't be viable. It's hard to see how a strategy can be implemented with less performance impact but maybe someone smarter can figure something out. It's most likely a dead end.

qoi-cache.h.txt