ananthakumaran / zstream Goto Github PK

An elixir library to write and read ZIP file in a streaming fashion

License: MIT License

Elixir 98.39% Ruby 1.38% Nix 0.23%

zipfile streaming zip elixir stream

zstream's Introduction

Zstream

An Elixir library to read and write ZIP file in a streaming fashion. It could consume data from any stream and write to any stream with constant memory overhead.

Installation

The package can be installed by adding :zstream to your list of dependencies in mix.exs:

def deps do
  [
    {:zstream, "~> 0.6"}
  ]
end

Examples

Zstream.zip([
  Zstream.entry("report.csv", Stream.map(records, &CSV.dump/1)),
  Zstream.entry("catfilm.mp4", File.stream!("/catfilm.mp4", [], 512), coder: Zstream.Coder.Stored)
])
|> Stream.into(File.stream!("/archive.zip"))
|> Stream.run

File.stream!("archive.zip", [], 512)
|> Zstream.unzip()
|> Enum.reduce(%{}, fn
  {:entry, %Zstream.Entry{name: file_name} = entry}, state -> state
  {:data, :eof}, state -> state
  {:data, data}, state -> state
end)

Features

zip

compression (deflate, stored)
encryption (traditional)
zip64

unzip

compression (deflate, stored)
zip64

License

This library is MIT licensed. See the LICENSE for details.

zstream's People

Contributors

Stargazers

Watchers

Forkers

subsetpark derrimahendra s33m4nn hazy kianmeng ryanwinchester-forks petelacey mpraski mwhitworth geofflane icodein strongfennecs

zstream's Issues

Problem with 4Gb+ archives

We want to stream many files that might result in archive size greater than 2^32 bytes. Here is how the issue can be reproduced:

Generate few files that will sum up in total size above 4,096 Gb

$ mkfile 10m 10m.pdf # generates 10Mb file on macOS

Zip all of them into an archive — small files must be the last ones, so their offset in the central directory is above 2^32

Zstream.zip([
  Zstream.entry("test/825m.pdf", File.stream!("/tmp/source/825m.pdf", [], 512), coder: Zstream.Coder.Stored),
  Zstream.entry("test/821m.pdf", File.stream!("/tmp/source/821m.pdf", [], 512), coder: Zstream.Coder.Stored),
  Zstream.entry("test/820m.pdf", File.stream!("/tmp/source/820m.pdf", [], 512), coder: Zstream.Coder.Stored),
  Zstream.entry("test/801m.pdf", File.stream!("/tmp/source/801m.pdf", [], 512), coder: Zstream.Coder.Stored),
  Zstream.entry("test/800m.pdf", File.stream!("/tmp/source/800m.pdf", [], 512), coder: Zstream.Coder.Stored),
  Zstream.entry("test/100m.pdf", File.stream!("/tmp/source/100m.pdf", [], 512), coder: Zstream.Coder.Stored),
  Zstream.entry("test/50m.pdf", File.stream!("/tmp/source/50m.pdf", [], 512), coder: Zstream.Coder.Stored)
])
|> Stream.into(File.stream!("/tmp/archive.zip"))
|> Stream.run

zipinfo shows an error

$ zipinfo archive.zip
Archive:  archive.zip
Zip file size: 4421845870 bytes, number of entries: 7
warning [archive.zip]:  4294967296 extra bytes at beginning or within zipfile
  (attempting to process anyway)
-rw-r--r--  5.2 fat 865075200 bl stor 22-Nov-29 09:30 test/825m.pdf
-rw-r--r--  5.2 fat 860880896 bl stor 22-Nov-29 09:30 test/821m.pdf
-rw-r--r--  5.2 fat 859832320 bl stor 22-Nov-29 09:30 test/820m.pdf
-rw-r--r--  5.2 fat 839909376 bl stor 22-Nov-29 09:30 test/801m.pdf
-rw-r--r--  5.2 fat 838860800 bl stor 22-Nov-29 09:30 test/800m.pdf
-rw-r--r--  5.2 fat 104857600 bl stor 22-Nov-29 09:30 test/100m.pdf
-rw-r--r--  5.2 fat 52428800 bl stor 22-Nov-29 09:30 test/50m.pdf
7 files, 4421844992 bytes uncompressed, 4421844992 bytes compressed:  0.0%

macOS doesn't unzip the file

unzip succeeds, but shows some offset warnings

$ unzip archive.zip
Archive:  archive.zip
Created by Zstream
warning [archive.zip]:  4294967296 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  4294967296
  (attempting to re-compensate)
 extracting: test/825m.pdf
 extracting: test/821m.pdf
 extracting: test/820m.pdf
 extracting: test/801m.pdf
 extracting: test/800m.pdf
 extracting: test/100m.pdf
file #7:  bad zipfile offset (local header sig):  74449256
  (attempting to re-compensate)
 extracting: test/50m.pdf

Something seems to be wrong with 64-bit offsets. Do you have an idea how it can be fixed?

cannot find header on valid zip file

demo.docx

Docx files are zip files. This docx file from docxelixir does not parse with zstream but does parse with erlang's :zip and the linux unzip command.

** (Zstream.Unzip.Error) Invalid zip file, could not find any signature header
     code: result = Docxelixir.read_paragraphs('samples/demo.docx')
     stacktrace:
       (zstream 0.6.3) lib/zstream/unzip.ex:209: Zstream.Unzip.next_header/2
       (zstream 0.6.3) lib/zstream/unzip.ex:199: Zstream.Unzip.file_data/2
       (zstream 0.6.3) lib/zstream/unzip.ex:180: Zstream.Unzip.filename_extra_field/2
       (elixir 1.14.0) lib/stream.ex:989: Stream.do_transform_user/6
       (elixir 1.14.0) lib/stream.ex:942: Stream.do_transform/5
       (elixir 1.14.0) lib/enum.ex:4307: Enum.reduce/3

Zip not fully formed

I'm using your library to stream files from a Phoenix endpoint. The endpoint looks something like this:

def download_file_list(conn, paths, uri, basepath) do
  conn =
    conn
    |> put_resp_content_type(Plug.MIME.type("zip"))
    |> put_resp_header(
      "content-disposition",
      "attachment; filename=\"#{Path.basename(basepath)}.zip\""
    )
    |> send_chunked(200)

  Enum.map(paths, fn path ->
    path_size = byte_size(uri)

    rel_path =
      case String.starts_with?(path, uri) do
        true ->
          <<_::binary-size(path_size), trailing::binary()>> = path
          trailing

        _ ->
          path
      end

    stream = File.stream!(path, [], @chunk_size)
    Zstream.entry(rel_path, stream, coder: Zstream.Coder.Stored)
  end)
  |> Zstream.zip()
  |> Enum.reduce_while(conn, fn chunk, conn ->
    case Plug.Conn.chunk(conn, chunk) do
      {:ok, conn} ->
        {:cont, conn}

      {:error, :closed} ->
        {:halt, conn}
    end
  end)
end

However, when downloading the zip, while it opens fine on Linux or with WinZip etc., it fails to open with Windows or Macs native archive tools.

Is this something that is known? Am I missing something? I can upload a zip somewhere if that will help.

Special character strings

Hi.

I tried to compress a file using the lib, it was a success but the file name does not remain the same, as if an encoding error had occurred.

Here is the code:

[Zstream.entry(file_name_with_extension, File.stream!(file_path, [:utf8]))]
|> Zstream.zip()
|> Stream.into(File.stream!(file_path_zip))
|> Stream.run()

The "file_name_with_extension" value is "Relatório.csv", but when the file is built the name inside the zip is "Relat├│rio.csv".

Problem with zip64 on native Xiaomi archiver utility

We are streaming lots of archives with zstream. It works very well except the native Xiaomi archiver utility, that fails to unzip a zip64 archive. The first idea was that this is a bug in that utility, but it handles zip64 archives from another streaming solution mod_zip with no issues.

Another hint that something might be wrong with zip64 format is that zipdetails mac/linux util fails to display zstreams zip64 details. Here is an example script and files.

Zstream.zip([
  Zstream.entry("test/10.txt", File.stream!("10.txt", [], 512), coder: Zstream.Coder.Stored)
])
|> Stream.into(File.stream!("/tmp/test.zip"))
|> Stream.run

Zstream.zip([
  Zstream.entry("test/10.txt", File.stream!("10.txt", [], 512), coder: Zstream.Coder.Stored)
], zip64: true)
|> Stream.into(File.stream!("/tmp/test_zip64.zip"))
|> Stream.run

$ zipdetails /tmp/test.zip

0000 LOCAL HEADER #1       04034B50
0004 Extract Zip Spec      14 '2.0'
0005 Extract OS            00 'MS-DOS'
0006 General Purpose Flag  0808
     [Bit  3]              1 'Streamed'
     [Bit 11]              1 'Language Encoding'
0008 Compression Method    0000 'Stored'
000A Last Mod Time         56385EDD 'Tue Jan 24 11:54:58 2023'
000E CRC                   00000000
0012 Compressed Length     00000000
0016 Uncompressed Length   00000000
001A Filename Length       000B
001C Extra Length          0000
001E Filename              'test/10.txt'
0029 PAYLOAD               123456789.

0033 STREAMING DATA HEADER 08074B50
0037 CRC                   E0117757
003B Compressed Length     0000000A
003F Uncompressed Length   0000000A

0043 CENTRAL HEADER #1     02014B50
0047 Created Zip Spec      34 '5.2'
0048 Created OS            00 'MS-DOS'
0049 Extract Zip Spec      14 '2.0'
004A Extract OS            00 'MS-DOS'
004B General Purpose Flag  0808
     [Bit  3]              1 'Streamed'
     [Bit 11]              1 'Language Encoding'
004D Compression Method    0000 'Stored'
004F Last Mod Time         56385EDD 'Tue Jan 24 11:54:58 2023'
0053 CRC                   E0117757
0057 Compressed Length     0000000A
005B Uncompressed Length   0000000A
005F Filename Length       000B
0061 Extra Length          0000
0063 Comment Length        0000
0065 Disk Start            0000
0067 Int File Attributes   0000
     [Bit 0]               0 'Binary Data'
0069 Ext File Attributes   81A40000
006D Local Header Offset   00000000
0071 Filename              'test/10.txt'

007C END CENTRAL HEADER    06054B50
0080 Number of this disk   0000
0082 Central Dir Disk no   0000
0084 Entries in this disk  0001
0086 Total Entries         0001
0088 Size of Central Dir   00000039
008C Offset to Central Dir 00000043
0090 Comment Length        0012
0092 Comment               'Created by Zstream'
Done

$ zipdetails /tmp/test_zip64.zip
0000 PREFIX DATA
Done

10.txt
test.zip
test_zip64.zip

Resumable download

Hi Anantha, I have an interesting question. Let's imagine the following case:

We have hundreds of files that we stream into a zip-archive from some cloud locations. We know the size of each file and its CRC upfront. File order within the stream is always the same. Connection interrupts and browser wants to resume the download sending us the range header. So, we know that browser has received the first X bytes of the stream. (Very theoretically) we can calculate all the file/header/descriptor offsets and continue sending bytes to the stream starting from a particular byte of a particular file.

Do you think it's doable? What would it take to implement it?

Files with data descriptor record are not supported

I create a zip file exactly like in the example:

Zstream.zip([Zstream.entry("some.pdf", File.stream!("some.pdf", [], 512), coder: Zstream.Coder.Deflate)])
  |> Stream.into(File.stream!("archive.zip"))
  |> Stream.run()

The archive.zip file can be opened and extracted by the Ubuntu Archive Manager.

I then read the zip as shown in the example:

    File.stream!("archive.zip", [], 512)
    |> Zstream.unzip()
    |> Enum.reduce(%{}, fn
      {:entry, %Zstream.Entry{name: file_name} = entry}, state -> state
      {:data, :eof}, state -> state
      {:data, data}, state -> state
    end)

It gives this error:

** (Zstream.Unzip.Error) Zip files with data descriptor record are not supported
    (zstream 0.6.4) lib/zstream/unzip.ex:140: Zstream.Unzip.local_file_header/2

If I debug, this is what the local header looks like:

%Zstream.Unzip.LocalHeader{
  compressed_size: 0,
  compression_method: 8,
  crc32: 0,
  extra_field: nil,
  extra_field_length: 0,
  extras: nil,
  file_name: nil,
  file_name_length: 8,
  general_purpose_bit_flag: 2056,
  last_modified_file_date: 22180,
  last_modified_file_time: 1304,
  uncompressed_size: 0,
  version_need_to_extract: 20
}

If I use the Ubuntu Archive Manager to create a zip file of the exact same PDF, then Zstream is able to read that zip file. The problem only occurs with zip files created by Zstream. Since it's the most basic example, I guess it's an OTP or Elixir version problem?