Giter Site home page Giter Site logo

Comments (5)

ncw avatar ncw commented on June 7, 2024 1

Finally had time to read this and digest the issue being described (sorry!) sweat_smile

Sorry it was a bit of a brain dump issue!

You're right, this is a little annoying; I've encountered similar issues with the Extract method before: it's slow on large archives, like if you want to extract just one file from a big .tar archive, the file might be at the very end. Or if you want to list a directory, yes, you have to scan the entire archive. Memoization can help, like you mentioned. I haven't implemented that yet since I haven't decided on a good way to manage the memory implicitly in this case. But I'm open to it! (That way the caller wouldn't necessarily have to do it, possibly.)

I can turn on caching for this in rclone and it will cache the archive to disk which would make the performance acceptable at the cost of disk usage. I could make this automatic for everything except .zips and that would probably satisfy most rclone users.

I'm not sure seeking tar files is really useful unless you know the offset for every file in the archive; you'd have to scan the whole archive once first to build that index, since I don't think tar keeps one like zip does.

You are right. It would require seeking the whole file to find out what files are within.

I was thinking this might be useful if you had a big tar with only a few files (imagine a 1GB tar with 10 x 100MB files) and just wanted to extract one of them, but with a tar with lots of files, seeking to each tar header in the stream will be very slow! (A seek in an HTTP stream means re-opening it with a new Range request and typically takes about 1s).

Most tars are compressed though, and seeking in a compressed tar is very hard. If the tar was compressed with gzip with sync points then it is easier, but that isn't standard.

It would be worth some experimenting to see if it's possible to get the byte offset reliably, especially using just the standard lib reader. If it doesn't do any/much buffering, then one possibility is to create a new method -- not sure what to call it, maybe Scan() or Index() -- that does a quick pass through the file to record the offsets as it goes. Then future lookups are O(1). (This would probably not work for compressed archives, like you mentioned.)

I think this is possible, but not with archive/tar. I looked at the code in that and there is a lot of state in the reader you'd have to replace to go back and read a file.

Ok, so now that I'm on the same page, I don't have a great solution to this either :( I've kind of come to accept the fact that tar archives can be slow to read in some cases -- and I should probably update the docs to mention that until we find a way to speed things up. But I'm very interested in speeding things up if possible.

I guess one possibility is to integrate this: https://pkg.go.dev/github.com/google/[email protected]/stargz which reads and writes tars with individually compressed files and an index - that would seek perfectly! Yes that is a bit of bradfitz magic :-) Its worth reading the explainer on the parent package.

That could be another compression format for archiver to support - not sure how you'd deal with the overlap with the existing tar format, or if you can easily tell a stargz file from normal .tar.gz.

I know tar files are the darling archive format of the Unix world, but frankly they're only good at one thing: streaming tape archives. They're not good if you need random access or structured file listings. :( In many ways, I do consider zip to be a superior format in that sense...

Aye :-)

from archiver.

mholt avatar mholt commented on June 7, 2024

Finally had time to read this and digest the issue being described (sorry!) 😅

You're right, this is a little annoying; I've encountered similar issues with the Extract method before: it's slow on large archives, like if you want to extract just one file from a big .tar archive, the file might be at the very end. Or if you want to list a directory, yes, you have to scan the entire archive. Memoization can help, like you mentioned. I haven't implemented that yet since I haven't decided on a good way to manage the memory implicitly in this case. But I'm open to it! (That way the caller wouldn't necessarily have to do it, possibly.)

I'm not sure seeking tar files is really useful unless you know the offset for every file in the archive; you'd have to scan the whole archive once first to build that index, since I don't think tar keeps one like zip does. It would be worth some experimenting to see if it's possible to get the byte offset reliably, especially using just the standard lib reader. If it doesn't do any/much buffering, then one possibility is to create a new method -- not sure what to call it, maybe Scan() or Index() -- that does a quick pass through the file to record the offsets as it goes. Then future lookups are O(1). (This would probably not work for compressed archives, like you mentioned.)

Ok, so now that I'm on the same page, I don't have a great solution to this either :( I've kind of come to accept the fact that tar archives can be slow to read in some cases -- and I should probably update the docs to mention that until we find a way to speed things up. But I'm very interested in speeding things up if possible.

I know tar files are the darling archive format of the Unix world, but frankly they're only good at one thing: streaming tape archives. They're not good if you need random access or structured file listings. :( In many ways, I do consider zip to be a superior format in that sense...

from archiver.

mholt avatar mholt commented on June 7, 2024

Thanks for the discussion!

Stargz looks like an improvement, although it is definitely non-standard... I'd be surprised if it was encountered in any significant number of places in the wild. I might be open to a proposal to how that would look/work with this library, but might only be convinced to accept it if it is popular enough.

Another option may be to simply not support the advanced/seeking feature for .tar (or at least compressed .tar) -- basically just support the functionality for file formats that are properly designed for such a thing. I know that kind of sucks, but so does the .tar format for seeking. 🙃

from archiver.

ncw avatar ncw commented on June 7, 2024

From an rclone point of view, only the .zip archiver works really well for reading. The in development archiver backend will write all the formats archiver supports but downloading is a problem for everything except .zip as rclone wants to read a file listing before transferring the files.

I guess I could do a backend specific command which would effectively run Extract on all the files if you wanted to be efficient about it. So an rclone backend extract :archive:drive:my.tar.gz destination:whatever command

I'm attracted to the stargz format as it would work well for reading and writing with the rclone backend so from that point of view it would be useful if it was part of Archiver.

If you want I can have a go at a PR

from archiver.

mholt avatar mholt commented on June 7, 2024

Sure, I'd be curious what a stargz PR would look like if it's not too invasive 🙂 👍

from archiver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.