OK, so here is the math!
Each build uncompressed is â28 MB. This does not include unnecessary source files or anything, so it does not look like we can go lower than that (unless we use compression).
So how many builds do we want to keep? Well, one year from now back is about 3000 commits. This gives roughly 84 GB
. And this is just one year of builds for just MoarVM backend. In about 10 years we will slowly start approaching the 1 TB mark. Multiply it by the number of backends we want to track.
Is it a lot? Well, no, but for folks who have an SSD (me) this might be a problem.
Given that people commit stuff at a slightly faster pace than the space becomes significantly cheaper, I think that we should compress it anyway (even if it is moved to a server with more space). It is a good idea in a long run. And it will make it easier for us to throw in some extra stuff (JVM builds, or maybe even 32-bit builds or something? You never know).
OK, so what can we do?
Filesystem compression
The most obvious one is to use compression in btrfs. The problem is that it is applied for each file individually, so we are not going to save anything across many builds. Also, it is only viable if you already have btrfs, so it looks like it is not the best option.
Compress each build individually
While it may sound like a great idea to compress all builds together, it does not work that well in practice. Well, it does, but keep reading.
The best compression I got is with 7z
. Each build is â4 MB (â28 MB uncompressed, therefore â7 times space saving!)
Compressing each build individually is also good for things like local bisect. That is, we can make these archives publicly available, and then write a script that will pull these archives for your local git bisect. How cool is that! That will be about 40 MB of stuff to download per git bisect, and you cannot really compress it any further anyway because you don't know ahead of time which files you would need.
This gives us about â120 GB per 10 years. Good enough, I like it.
Is there anything that performs better than 7z? Well, yes:
27M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar
9.0M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lz4
7.0M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.zip
6.7M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.gz
6.7M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzfse
6.4M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.bz2
5.7M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L15.tar.zst
4.9M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L22.tar.zst
4.7M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzham
4.5M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.brotli
4.3M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lrz
4.1M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.7z
4.1M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tpxz
4.0M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.xz
3.8M c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-extremely-slow.tar.lrz
xz
is much slower during compression and is tiny bit slower when decompressing, so the win is insignificant. lrz
with extreme options is much slower at everything, so forget it.
Let's compress everything together!
For the tests, I took only 7 builds:
c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7
3878066a953195276ef99739f157d38793395d06
ef04e1e07f1de0d4eb2666985c7290f96c912be6
eea786e2a2febef0ab0bfabca956beae95ab81fd
76be77c9d6e697c26e92dc704109b7b8780845aa
d30806bd00dee201cd891550ac65f084f18e8285
9bfbab9186d710e0603b1eb86be1e5ba2e0c84d1
Now, there are some funny entries here. Obviously, .tar
is the size of all repos together, uncompressed. git-repo
is a git repo with every build committed one after another in the right order. Look, it performs better than some of the other things! Amazing! And even better if you compress it afterwards. Who would've thought, right?
187M all.tar
63M all.tar.lz4
49M all.zip
47M all.tar.gz
45M all.tar.bz2
42M git-repo.tar
35M all.tar.zstd # -19
28M all.tar.xz
28M all.7z
19M git-repo.7z
9.5M all.tar.zstd # -19 --long
8.3M all.tar.lrz
7.3M all.tar.zstd # --ultra -22 --long=31 --format=lzma # but it's much slower than lrz when compressing
However, lrz
clearly wins this time. And that's with default settings! Wow.
Conclusion
I think that there are ways to fiddle with options to get even better results. Suggestions are welcome!
However, at this point it looks like the best way is to use 7z
to compress each build individually.
Messing around with one huge archive is probably not worth the savings. It will make decompression time significantly slower, but we want decompression to be as fast as possible.