OK, so here is the math! Each build uncompressed is ≈28 MB. This doe

Actually I like the git repo idea very much. <p dir="au

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

By the way, I found this blog post very interesting: <a href="http://fastcompression.b

Another article about Zstandard. <a href="https://code.facebook.com/

Save space by compressing at least some of the builds about whateverable HOT 16 CLOSED

raku commented on July 23, 2024

Save space by compressing at least some of the builds

from whateverable.

Comments (16)

niner commented on July 23, 2024

On Freitag, 12. August 2016 07:51:40 CEST Aleks-Daniel Jakimenko-Aleksejev

Now, there are some funny entries here. Obviously, .tar is the size of
all repos together, uncompressed. git-repo is a git repo with every
build committed one after another in the right order. Look, it performs
better than some of the other things! Amazing! And even better if you
compress it afterwards. Who would've thought, right?

However, lrz clearly wins this time. And that's with default settings!
Wow.

Conclusion

I think that there are ways to fiddle with options to get even better
results. Suggestions are welcome!

Acutally I like the git repo idea very much. With the build files in a git repo
instead of the source files, one can use plain old git bisect to find the
offending commit. I'm not surprised that git performs so well on this task as
it stores each content only once. So unless a file got changed, it will not be
stored for the new version.

Have you run git repack after committing the different versions? That should
reduce the repository size considerably, since it uses delta compression which
should help especially with te larger build files that change a lot like
CORE.setting.moarvm

Stefan

from whateverable.

AlexDaniel commented on July 23, 2024

Actually I like the git repo idea very much.

It is hard to tell if it is going to perform better when we put all builds into it. Currently, with just 7 builds in, 28 MB repo size is equivalent to storing each build separately (≈4 MB per build).

Also, I'm not sure if performance is going to be adequate. Bisect has to jump a couple of hundreds commits back and forth, that is definitely slower than just unpacking a 4 MB archive (or am I wrong?).

Have you run git repack after committing the different versions?

Well, yes, it says there's nothing to repack (perhaps git gc called it automatically?).

from whateverable.

MasterDuke17 commented on July 23, 2024

How about testing LZ4 or LZHAM? I suspect they won't compress as well, but they are supposed to be very fast at decompressing, so the trade-off might be worth it.

from whateverable.

AlexDaniel commented on July 23, 2024

@MasterDuke17 I've added lz4 (and a bunch of other stuff) to the main post.

LZ4 is actually a very good finding, thank you very much. Indeed, we should probably forget about space savings and think about decompression speed instead.

How long does it take to decompress one build compressed with 7z? 0.4s. Why? See this. Basically, LZMA is not a good fit for multithreading (besides just being slow).
7z does not look like a good candidate anymore. Typical bisect takes a bit less than 15 steps, so that's a 5 second delay just to decompress the data. We can speed it up a bit by keeping at least the starting points decompressed, but in the end it is still a significant delay.

So let's see how fast things decompress:

0m45.291s   c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-extremely-slow.tar.lrz
0m0.716s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.bz2
0m0.548s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzfse
0m0.421s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lrz
0m0.415s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.xz
0m0.414s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.7z
0m0.325s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tpxz
0m0.262s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.zip
0m0.225s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.gz
0m0.208s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzham
0m0.169s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.brotli
0m0.130s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L22.tar.zst
0m0.120s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L15.tar.zst
0m0.087s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lz4
0m0.049s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar

Almost everything is with default options, so feel free to recommend something specific.

As stupid as it sounds, brotli is a clear winner right now (UPDATE: nope. See next comment). It is a bit slow during compression, but I don't mind it at all.

from whateverable.

AlexDaniel commented on July 23, 2024

We have a new winner: https://github.com/Cyan4973/zstd

≈0m0.130s decompression, ≈4.9M size, compression faster than brotli. Basically, it is a winner on all criteria except for file size, and it is only ≈0.4MB worse. Where is the catch??
In fact, zstd is rather young, and if I understand correctly it is not multithreaded yet. So perhaps it will get even better?

We can tweak it a bit by using a different compression level. The numbers above are with max level (22), but we can make it ≈10ms faster by sacrificing ≈0.8MB (level 15). I don't care about neither of these.

from whateverable.

xenu commented on July 23, 2024

@AlexDaniel: were you using "plain" xz or maybe pixz in these benchmarks? pixz claims to be much faster on both compression and decompression, thanks to its parallelization.

from whateverable.

AlexDaniel commented on July 23, 2024

@xenu pixz is .tpxz in these tests. It does perform faster, but not too much.

from whateverable.

AlexDaniel commented on July 23, 2024

By the way, I found this blog post very interesting: http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html
It is by the author of lz4 and zstd. Amazing stuff.

from whateverable.

AlexDaniel commented on July 23, 2024

OK, so this has been implement some time ago along with other major changes. Given that everything is written in 6lang now, some things tend to segfault sometimes… but otherwise everything is fine. At least, compression is definitely there, so I am closing this.

from whateverable.

MasterDuke17 commented on July 23, 2024

Another article about Zstandard.

https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/

from whateverable.

AlexDaniel commented on July 23, 2024

Some news! I never liked the dependency on lrzip because it only appears required when working with old builds. Now that zstd has a long-range mode I think it's better to rely on zstd for everything.

I have updated the post, but basically this is my finding:

9.5M    all.tar.zstd # -19 --long
8.3M    all.tar.lrz
7.3M    all.tar.zstd # -19 --long --format=lzma # but it's much slower than lrz when compressing

So lrzip is still somewhat winning when it comes to compression ratio. Moreover, it is blazing fast during compression. zstd takes ≈70 seconds to finish the job, while lrzip gets it done in ≈20s.

However, what I actually care about is decompression speed (because we need fast access to slowly accumulating builds):

9.5M    all.tar.zstd  # 0.338s
8.3M    all.tar.lrz   # 1.332s
7.3M    all.tar.zstd  # 0.946s

So lrzip stands out as being noticeably slow. It's clear that after the transition we will win both compression ratio and decompression speed.

This test is done using 7 builds in a single archive but ~~I chose that number based on the behavior of lrzip~~ actually, it was picked semi-randomly for benchmarking only, the actual number of builds per archive is 20. That is, putting more builds into the archive tends to make decompression slower but obviously improves the overall compression ratio.

I guess the next step is to increase the number of builds in zstd archives until I reach the same decompression speed, then compare the ratio.

from whateverable.

AlexDaniel commented on July 23, 2024

<MasterDuke> AlexDaniel`: i wonder if https://github.com/mhx/dwarfs would be good for the *ables. might make managing the builds simpler

(log)

Fantastic find! We should definitely bench it one day.

from whateverable.

AlexDaniel commented on July 23, 2024

I think using zstd for everything is a good idea. It'd make some code paths more generic, and it'll drop the dependency on lrzip. dwarfs is fine locally, but whateverable is also serving files for Blin and other remote usages, so zstd is still needed.

from whateverable.

AlexDaniel commented on July 23, 2024

Closing this in favor of #389.

from whateverable.

MasterDuke17 commented on July 23, 2024

I think using zstd for everything is a good idea. It'd make some code paths more generic, and it'll drop the dependency on lrzip. dwarfs is fine locally, but whateverable is also serving files for Blin and other remote usages, so zstd is still needed.

If the whole architecture was being redone from scratch, using dwarfs for local storage and compressing on the fly with zstd when serving files would be an interesting experiment.

from whateverable.

AlexDaniel commented on July 23, 2024

Yes and no 🤷🏼 Depending on how much you're using the mothership remotely, it might be that you'll have lots of archives locally. If they're compressed in long-range mode, your local setup will be as efficient (storage-wise) as the remote one. Compressing on the fly in long-range mode doesn't work (it takes roughly a minute to do that for 20 builds). Of course, nothing stops you from using dwarfs locally, but the current system with archives seems easier and simpler.

from whateverable.

Save space by compressing at least some of the builds about whateverable HOT 16 CLOSED

Comments (16)

Conclusion

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent