Right now, when addFile() is called with a stream as

Was this reverted recently in <a class="commit-link" data-hovercard-type="commit" data

use streaming sources as streams about node-archiver HOT 25 CLOSED

danmilon commented on May 23, 2024

use streaming sources as streams

from node-archiver.

Comments (25)

ctalkington commented on May 23, 2024

well it used to be this way, for zips atleast. tar has to know length beforehand. i had it collect as it simplifies code for file queuing.

really, im not sure how much a difference it makes cause we need to feed ourself in order. so if stream is slow anyways, it will hold things up just as much as collecting it does. the collector just makes a nice wrapper around the process.

from node-archiver.

danmilon commented on May 23, 2024

It makes no difference if you wanna get the zip in memory, but it should do if you pipe it to another stream. If you collect, you send to the dest stream a large badass block of data, but if you stream it, you're sending little pieces. with the latter approach, the underlying stream should handle better.

It also should make a difference in memory usage. (ie you don't have to hold data in-memory, you can feed it downstream in pieces).

At least that's what makes sense to me. I'll give this a shot sometime.

from node-archiver.

ctalkington commented on May 23, 2024

true about memory, i want to write some benchmarks in the near future to test with various data sizes.

with readable-stream, we are basically moving the emitting part out of archiver's realm. its simply pushing data into an internal buffer that readable-stream manages. there is a low watermark that asks for data when needed and a high watermark that tells it when to emit. the internal push method is what we use to fill buffer when we have data. this is a little different than the normal way i think it was envisioned but as we are working with variable input sources its the way i've structured it to work.

from node-archiver.

danmilon commented on May 23, 2024

I don't see how readable-stream's internal buffer is a solution to the first issue.

They way I think of it

archive.pipe(net.createConnection(...))

will be faster if you send the net stream, smaller chunks and sooner, than delivering a large blob later.

from node-archiver.

ctalkington commented on May 23, 2024

faster only for zips, would be interesting to see benchmarks. i've mainly worked with fs haven't done much testing with net.

it actually reads 16kb chunks for read by default, my bad but looks like it'll read the whole thing if pipe doesn't tell it how much it wants. its just the new streams2 design.

also, since archiver is based on readable-stream, advanced users can control the buffer, watermarks, pipe sizes, etc. i do need to further test some parts of the concept though.

https://github.com/isaacs/readable-stream#new-readableoptions

from node-archiver.

ctalkington commented on May 23, 2024

also in regards to speed and piping. its really controlled by user. cause there could be huge gaps between finishing reading a stream and when user adds next or calls finalize. thats one of the more complex parts because most streams rely on one source. we are usually waiting on multiple and for the user to define them.

next time that i'm in the code, I'll see what I can do for zip and pushing data earlier, just think its going to add complexity that I was trying to avoid by taking emitting out of each archiver sub-class.

from node-archiver.

ctalkington commented on May 23, 2024

thanks for the feedback so far, much appreciated. i do think this new structure is for the best and the fact that its letting us be v0.10 ready while supporting v0.6, v0.8, v0.9 is a plus. i just got so tired of managing the emit queue.

from node-archiver.

ctalkington commented on May 23, 2024

ok, zip should now push at first chance.

from node-archiver.

danmilon commented on May 23, 2024

Was this reverted recently in 0b6d0fe due to #16?
I'm not saying it shouldn't be fixed. I'm saying it should be fixed properly. Reverting this feature so it "just works" is no solution.

To illustrate the difference it makes:

$ git checkout 0b6d0f
$ node benchmark/simple/pack-zip.js ~/Videos/joyent\ api.mp4 
min: 71.66 mbits/s
avg: 115.61 mbits/s
max: 157.12 mbits/s
median: 133.36 mbits/s

$ git checkout HEAD^
$ node benchmark/simple/pack-zip.js ~/Videos/joyent\ api.mp4 
min: 199.04 mbits/s
avg: 230.04 mbits/s
max: 246.55 mbits/s
median: 230.81 mbits/s

Where joyent api.mp4 is a 200mb file. Of course memory usage rises to 200+mb, compared to constant 50 with the latter.

Did you figure out what was the issue with #16? Let's fix it properly.

from node-archiver.

ctalkington commented on May 23, 2024

nope, its most likely the compression level. the issue was that the way the zlib helper worked with callback was pushing data out of order by using the old we listen on zlib data which should always be in order.

from node-archiver.

ctalkington commented on May 23, 2024

@danmilon i've added the ability to specify compression level by 2nd arg.

Compression levels.
#define Z_NO_COMPRESSION 0
#define Z_BEST_SPEED 1
#define Z_BEST_COMPRESSION 9
#define Z_DEFAULT_COMPRESSION (-1)

from node-archiver.

ctalkington commented on May 23, 2024

whats interesting is that i default to best speed (1) so maybe its just the way it handles the data different from the helper functions (which we used before) although they should just be making a new instance internally (see below) so not sure why such a drop.

Convenience Methods
All of these take a string or buffer as the first argument, and call the supplied callback with callback(error, result). The compression/decompression engine is created using the default settings in all convenience methods.

i was happy with the helper methods, although little less control, its just that depending on the data received from stream, the callbacks would get fired out of order.

it could also be that we are writing data to zlib which has its own buffer (chunkSize (default: 16*1024)) before it triggers data emit. so that extra step could very well be slowing things down as i'm not sure if node tweaks the buffering when using callbacks.

from node-archiver.

danmilon commented on May 23, 2024

putting the fact that it's slower aside, my point is that it drains the whole stream in memory before processing it. The change is in 0b6d0fe, as shown above. This should be reopened IMO.

from node-archiver.

ctalkington commented on May 23, 2024

@danmilon if you have a solution that works around out of order callbacks, id be more than open to investigating the newer way.

from node-archiver.

danmilon commented on May 23, 2024

trying to find the cause of the regression. I'll keep you posted.

from node-archiver.

ctalkington commented on May 23, 2024

i think it had to do with sometimes data would take longer to compress so its callback wouldnt get fired in order.

EDIT: i believe the only way around it would be to keep a buffer from stream and have zlib callback do the next in order vs on data events.

from node-archiver.

ctalkington commented on May 23, 2024

@danmilon would you try out ea88dac and let me know if it makes much difference.

actually make that 9f0ba92

from node-archiver.

danmilon commented on May 23, 2024

Same. Here's an idea. Why don't we pipe source to deflate (auto back-pressure) and then deflate.on('data' ...)? Seems like we're reinventing .pipe() here.

from node-archiver.

ctalkington commented on May 23, 2024

hum are you sure the performance hit is actually coming from the deflate change? cause what i did in that commit def should have resolved pref issues. the pipe may work hum.

EDIT: ok so pipe does work from initial test but we need the data from stream to compute checksum

from node-archiver.

ctalkington commented on May 23, 2024

i def saw a speed improvement though with pipe on 100mb file but as soon as i listened for raw data to do checksum it dropped in half..

min: 0.00 mbits/s
avg: 5.76 mbits/s
max: 10.74 mbits/s
median: 9.93 mbits/s

min: 0.00 mbits/s
avg: 3.58 mbits/s
max: 5.53 mbits/s
median: 4.54 mbits/s

EDIT: ok the performance dip is actually CRC32.
EDIT2: the test file i'm using is pure text so they are way lower due to the shear amount of compressing going on.

from node-archiver.

ctalkington commented on May 23, 2024

im wondering to do this in a performant way if we are going to have to make a zlib based class that can do checksum building while being piped data.

EDIT: this idea works but checksum kills the speed.

from node-archiver.

ctalkington commented on May 23, 2024

can you give eab88eb a go and let me know the results you get. if you would also try something other than video as that doesn't do much compression. so it skews things a bit.

from node-archiver.

danmilon commented on May 23, 2024

See comments in f0bf860

from node-archiver.

ctalkington commented on May 23, 2024

@danmilon can you give master a go and let me know if all our changes have helped at all.

from node-archiver.

danmilon commented on May 23, 2024

all good. Memory steady at 10mb, magnificent!

from node-archiver.

use streaming sources as streams about node-archiver HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent