Comments (16)
On Freitag, 12. August 2016 07:51:40 CEST Aleks-Daniel Jakimenko-Aleksejev
Now, there are some funny entries here. Obviously,
.tar
is the size of
all repos together, uncompressed.git-repo
is a git repo with every
build committed one after another in the right order. Look, it performs
better than some of the other things! Amazing! And even better if you
compress it afterwards. Who would've thought, right?However,
lrz
clearly wins this time. And that's with default settings!
Wow.Conclusion
I think that there are ways to fiddle with options to get even better
results. Suggestions are welcome!
Acutally I like the git repo idea very much. With the build files in a git repo
instead of the source files, one can use plain old git bisect to find the
offending commit. I'm not surprised that git performs so well on this task as
it stores each content only once. So unless a file got changed, it will not be
stored for the new version.
Have you run git repack after committing the different versions? That should
reduce the repository size considerably, since it uses delta compression which
should help especially with te larger build files that change a lot like
CORE.setting.moarvm
Stefan
from whateverable.
Actually I like the git repo idea very much.
It is hard to tell if it is going to perform better when we put all builds into it. Currently, with just 7 builds in, 28 MB repo size is equivalent to storing each build separately (β4 MB per build).
Also, I'm not sure if performance is going to be adequate. Bisect has to jump a couple of hundreds commits back and forth, that is definitely slower than just unpacking a 4 MB archive (or am I wrong?).
Have you run git repack after committing the different versions?
Well, yes, it says there's nothing to repack (perhaps git gc
called it automatically?).
from whateverable.
How about testing LZ4 or LZHAM? I suspect they won't compress as well, but they are supposed to be very fast at decompressing, so the trade-off might be worth it.
from whateverable.
@MasterDuke17 I've added lz4 (and a bunch of other stuff) to the main post.
LZ4 is actually a very good finding, thank you very much. Indeed, we should probably forget about space savings and think about decompression speed instead.
How long does it take to decompress one build compressed with 7z
? 0.4s. Why? See this. Basically, LZMA is not a good fit for multithreading (besides just being slow).
7z
does not look like a good candidate anymore. Typical bisect takes a bit less than 15 steps, so that's a 5 second delay just to decompress the data. We can speed it up a bit by keeping at least the starting points decompressed, but in the end it is still a significant delay.
So let's see how fast things decompress:
0m45.291s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-extremely-slow.tar.lrz
0m0.716s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.bz2
0m0.548s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzfse
0m0.421s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lrz
0m0.415s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.xz
0m0.414s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.7z
0m0.325s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tpxz
0m0.262s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.zip
0m0.225s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.gz
0m0.208s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzham
0m0.169s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.brotli
0m0.130s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L22.tar.zst
0m0.120s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L15.tar.zst
0m0.087s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lz4
0m0.049s c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar
Almost everything is with default options, so feel free to recommend something specific.
As stupid as it sounds, brotli is a clear winner right now (UPDATE: nope. See next comment). It is a bit slow during compression, but I don't mind it at all.
from whateverable.
We have a new winner: https://github.com/Cyan4973/zstd
β0m0.130s decompression, β4.9M size, compression faster than brotli. Basically, it is a winner on all criteria except for file size, and it is only β0.4MB worse. Where is the catch??
In fact, zstd is rather young, and if I understand correctly it is not multithreaded yet. So perhaps it will get even better?
We can tweak it a bit by using a different compression level. The numbers above are with max level (22), but we can make it β10ms faster by sacrificing β0.8MB (level 15). I don't care about neither of these.
from whateverable.
@AlexDaniel: were you using "plain" xz
or maybe pixz
in these benchmarks? pixz
claims to be much faster on both compression and decompression, thanks to its parallelization.
from whateverable.
@xenu pixz
is .tpxz
in these tests. It does perform faster, but not too much.
from whateverable.
By the way, I found this blog post very interesting: http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html
It is by the author of lz4
and zstd
. Amazing stuff.
from whateverable.
OK, so this has been implement some time ago along with other major changes. Given that everything is written in 6lang now, some things tend to segfault sometimes⦠but otherwise everything is fine. At least, compression is definitely there, so I am closing this.
from whateverable.
Another article about Zstandard.
https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/
from whateverable.
Some news! I never liked the dependency on lrzip
because it only appears required when working with old builds. Now that zstd
has a long-range mode I think it's better to rely on zstd for everything.
I have updated the post, but basically this is my finding:
9.5M all.tar.zstd # -19 --long
8.3M all.tar.lrz
7.3M all.tar.zstd # -19 --long --format=lzma # but it's much slower than lrz when compressing
So lrzip
is still somewhat winning when it comes to compression ratio. Moreover, it is blazing fast during compression. zstd
takes β70 seconds to finish the job, while lrzip
gets it done in β20s.
However, what I actually care about is decompression speed (because we need fast access to slowly accumulating builds):
9.5M all.tar.zstd # 0.338s
8.3M all.tar.lrz # 1.332s
7.3M all.tar.zstd # 0.946s
So lrzip
stands out as being noticeably slow. It's clear that after the transition we will win both compression ratio and decompression speed.
This test is done using 7 builds in a single archive but I chose that number based on the behavior of actually, it was picked semi-randomly for benchmarking only, the actual number of builds per archive is 20. That is, putting more builds into the archive tends to make decompression slower but obviously improves the overall compression ratio.lrzip
I guess the next step is to increase the number of builds in zstd archives until I reach the same decompression speed, then compare the ratio.
from whateverable.
<MasterDuke> AlexDaniel`: i wonder if https://github.com/mhx/dwarfs would be good for the *ables. might make managing the builds simpler
(log)
Fantastic find! We should definitely bench it one day.
from whateverable.
I think using zstd for everything is a good idea. It'd make some code paths more generic, and it'll drop the dependency on lrzip. dwarfs is fine locally, but whateverable is also serving files for Blin and other remote usages, so zstd is still needed.
from whateverable.
Closing this in favor of #389.
from whateverable.
I think using zstd for everything is a good idea. It'd make some code paths more generic, and it'll drop the dependency on lrzip. dwarfs is fine locally, but whateverable is also serving files for Blin and other remote usages, so zstd is still needed.
If the whole architecture was being redone from scratch, using dwarfs for local storage and compressing on the fly with zstd when serving files would be an interesting experiment.
from whateverable.
Yes and no π€·πΌ Depending on how much you're using the mothership remotely, it might be that you'll have lots of archives locally. If they're compressed in long-range mode, your local setup will be as efficient (storage-wise) as the remote one. Compressing on the fly in long-range mode doesn't work (it takes roughly a minute to do that for 20 builds). Of course, nothing stops you from using dwarfs locally, but the current system with archives seems easier and simpler.
from whateverable.
Related Issues (20)
- Automatic tell shouldn't work for bridged users HOT 1
- Tellable should format CTCP ACTION HOT 1
- How old are you?
- tellable and perhaps other bots should support s/foo/bar/ messages
- tellable and perhaps other bots should support ^^ messages
- Rakudo 2019.10 is now relocatable, use that to avoid locks HOT 1
- Tellable commands should count as messages
- Test ticket (please ignore)
- Can't install because of `Sake` dep instead of `sake` HOT 3
- Bot "thanks" regex is too permissive HOT 2
- Type check failed in push to Buf; expected uint8 but got Str HOT 1
- Tellable should respond to private messages HOT 4
- nick recognition and normalization is wrong HOT 1
- Refactoring, cleanup, tests (2020) HOT 1
- Automatically detect wrong bisection results due to flapping code HOT 1
- Tellable doesn't say anything when it intentionally discards messages
- GLOBAL symbol failure on install HOT 7
- fails on install with "...Redeclaration of symbol 'Message'...." HOT 4
- The state of Whateverable (2023 server migration) HOT 6
- Replace lrzip with zstd HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whateverable.