Giter Site home page Giter Site logo

yauzl's Introduction

yauzl

yet another unzip library for node. For zipping, see yazl.

Design principles:

  • Follow the spec. Don't scan for local file headers. Read the central directory for file metadata. (see No Streaming Unzip API).
  • Don't block the JavaScript thread. Use and provide async APIs.
  • Keep memory usage under control. Don't attempt to buffer entire files in RAM at once.
  • Never crash (if used properly). Don't let malformed zip files bring down client applications who are trying to catch errors.
  • Catch unsafe file names. See validateFileName().

Usage

var yauzl = require("yauzl");

yauzl.open("path/to/file.zip", {lazyEntries: true}, function(err, zipfile) {
  if (err) throw err;
  zipfile.readEntry();
  zipfile.on("entry", function(entry) {
    if (/\/$/.test(entry.fileName)) {
      // Directory file names end with '/'.
      // Note that entries for directories themselves are optional.
      // An entry's fileName implicitly requires its parent directories to exist.
      zipfile.readEntry();
    } else {
      // file entry
      zipfile.openReadStream(entry, function(err, readStream) {
        if (err) throw err;
        readStream.on("end", function() {
          zipfile.readEntry();
        });
        readStream.pipe(somewhere);
      });
    }
  });
});

See also examples/ for more usage examples.

API

The default for every optional callback parameter is:

function defaultCallback(err) {
  if (err) throw err;
}

open(path, [options], [callback])

Calls fs.open(path, "r") and reads the fd effectively the same as fromFd() would.

options may be omitted or null. The defaults are {autoClose: true, lazyEntries: false, decodeStrings: true, validateEntrySizes: true, strictFileNames: false}.

autoClose is effectively equivalent to:

zipfile.once("end", function() {
  zipfile.close();
});

lazyEntries indicates that entries should be read only when readEntry() is called. If lazyEntries is false, entry events will be emitted as fast as possible to allow pipe()ing file data from all entries in parallel. This is not recommended, as it can lead to out of control memory usage for zip files with many entries. See issue #22. If lazyEntries is true, an entry or end event will be emitted in response to each call to readEntry(). This allows processing of one entry at a time, and will keep memory usage under control for zip files with many entries.

decodeStrings is the default and causes yauzl to decode strings with CP437 or UTF-8 as required by the spec. The exact effects of turning this option off are:

  • zipfile.comment, entry.fileName, and entry.fileComment will be Buffer objects instead of Strings.
  • Any Info-ZIP Unicode Path Extra Field will be ignored. See extraFields.
  • Automatic file name validation will not be performed. See validateFileName().

validateEntrySizes is the default and ensures that an entry's reported uncompressed size matches its actual uncompressed size. This check happens as early as possible, which is either before emitting each "entry" event (for entries with no compression), or during the readStream piping after calling openReadStream(). See openReadStream() for more information on defending against zip bomb attacks.

When strictFileNames is false (the default) and decodeStrings is true, all backslash (\) characters in each entry.fileName are replaced with forward slashes (/). The spec forbids file names with backslashes, but Microsoft's System.IO.Compression.ZipFile class in .NET versions 4.5.0 until 4.6.1 creates non-conformant zipfiles with backslashes in file names. strictFileNames is false by default so that clients can read these non-conformant zipfiles without knowing about this Microsoft-specific bug. When strictFileNames is true and decodeStrings is true, entries with backslashes in their file names will result in an error. See validateFileName(). When decodeStrings is false, strictFileNames has no effect.

The callback is given the arguments (err, zipfile). An err is provided if the End of Central Directory Record cannot be found, or if its metadata appears malformed. This kind of error usually indicates that this is not a zip file. Otherwise, zipfile is an instance of ZipFile.

fromFd(fd, [options], [callback])

Reads from the fd, which is presumed to be an open .zip file. Note that random access is required by the zip file specification, so the fd cannot be an open socket or any other fd that does not support random access.

options may be omitted or null. The defaults are {autoClose: false, lazyEntries: false, decodeStrings: true, validateEntrySizes: true, strictFileNames: false}.

See open() for the meaning of the options and callback.

fromBuffer(buffer, [options], [callback])

Like fromFd(), but reads from a RAM buffer instead of an open file. buffer is a Buffer.

If a ZipFile is acquired from this method, it will never emit the close event, and calling close() is not necessary.

options may be omitted or null. The defaults are {lazyEntries: false, decodeStrings: true, validateEntrySizes: true, strictFileNames: false}.

See open() for the meaning of the options and callback. The autoClose option is ignored for this method.

fromRandomAccessReader(reader, totalSize, [options], [callback])

This method of reading a zip file allows clients to implement their own back-end file system. For example, a client might translate read calls into network requests.

The reader parameter must be of a type that is a subclass of RandomAccessReader that implements the required methods. The totalSize is a Number and indicates the total file size of the zip file.

options may be omitted or null. The defaults are {autoClose: true, lazyEntries: false, decodeStrings: true, validateEntrySizes: true, strictFileNames: false}.

See open() for the meaning of the options and callback.

dosDateTimeToDate(date, time)

Converts MS-DOS date and time data into a JavaScript Date object. Each parameter is a Number treated as an unsigned 16-bit integer. Note that this format does not support timezones. The returned Date object will be constructed using the local timezone.

In order to interpret the parameters in UTC time instead of local time, you can convert with the following snippet:

var timestampInterpretedAsLocal = yauzl.dosDateTimeToDate(date, time); // or entry.getLastModDate()
var timestampInterpretedAsUTCInstead = new Date(
    timestampInterpretedAsLocal.getTime() -
    timestampInterpretedAsLocal.getTimezoneOffset() * 60 * 1000
);

Note that there is an ECMAScript proposal to add better timezone support to JavaScript called the Temporal API. Last I checked, it is at stage 3. https://github.com/tc39/proposal-temporal

Once that new API is available and stable, better timezone handling should be possible here somehow. Feel free to open a feature request against this library when the time comes.

getFileNameLowLevel(generalPurposeBitFlag, fileNameBuffer, extraFields, strictFileNames)

If you are setting decodeStrings to false, then this function can be used to decode the file name yourself. This function is effectively used internally by yauzl to populate the entry.fileName field when decodeStrings is true.

WARNING: This method of getting the file name bypasses the security checks in validateFileName(). You should call that function yourself to be sure to guard against malicious file paths.

generalPurposeBitFlag can be found on an Entry or LocalFileHeader. Only General Purpose Bit 11 is used, and only when an Info-ZIP Unicode Path Extra Field cannot be found in extraFields.

fileNameBuffer is a Buffer representing the file name field of the entry. This is entry.fileNameRaw or localFileHeader.fileName.

extraFields is the parsed extra fields array from entry.extraFields or parseExtraFields().

strictFileNames is a boolean, the same as the option of the same name in open(). When false, backslash characters (\) will be replaced with forward slash characters (/).

This function always returns a string, although it may not be a valid file name. See validateFileName().

validateFileName(fileName)

Returns null or a String error message depending on the validity of fileName. If fileName starts with "/" or /[A-Za-z]:\// or if it contains ".." path segments or "\\", this function returns an error message appropriate for use like this:

var errorMessage = yauzl.validateFileName(fileName);
if (errorMessage != null) throw new Error(errorMessage);

This function is automatically run for each entry, as long as decodeStrings is true. See open(), strictFileNames, and Event: "entry" for more information.

parseExtraFields(extraFieldBuffer)

This function is used internally by yauzl to compute entry.extraFields. It is exported in case you want to call it on localFileHeader.extraField.

extraFieldBuffer is a Buffer, such as localFileHeader.extraField. Returns an Array with each item in the form {id: id, data: data}, where id is a Number and data is a Buffer. Throws an Error if the data encodes an item with a size that exceeds the bounds of the buffer.

You may want to surround calls to this function with try { ... } catch (err) { ... } to handle the error.

Class: ZipFile

The constructor for the class is not part of the public API. Use open(), fromFd(), fromBuffer(), or fromRandomAccessReader() instead.

Event: "entry"

Callback gets (entry), which is an Entry. See open() and readEntry() for when this event is emitted.

If decodeStrings is true, entries emitted via this event have already passed file name validation. See validateFileName() and open() for more information.

If validateEntrySizes is true and this entry's compressionMethod is 0 (stored without compression), this entry has already passed entry size validation. See open() for more information.

Event: "end"

Emitted after the last entry event has been emitted. See open() and readEntry() for more info on when this event is emitted.

Event: "close"

Emitted after the fd is actually closed. This is after calling close() (or after the end event when autoClose is true), and after all stream pipelines created from openReadStream() have finished reading data from the fd.

If this ZipFile was acquired from fromRandomAccessReader(), the "fd" in the previous paragraph refers to the RandomAccessReader implemented by the client.

If this ZipFile was acquired from fromBuffer(), this event is never emitted.

Event: "error"

Emitted in the case of errors with reading the zip file. (Note that other errors can be emitted from the streams created from openReadStream() as well.) After this event has been emitted, no further entry, end, or error events will be emitted, but the close event may still be emitted.

readEntry()

Causes this ZipFile to emit an entry or end event (or an error event). This method must only be called when this ZipFile was created with the lazyEntries option set to true (see open()). When this ZipFile was created with the lazyEntries option set to true, entry and end events are only ever emitted in response to this method call.

The event that is emitted in response to this method will not be emitted until after this method has returned, so it is safe to call this method before attaching event listeners.

After calling this method, calling this method again before the response event has been emitted will cause undefined behavior. Calling this method after the end event has been emitted will cause undefined behavior. Calling this method after calling close() will cause undefined behavior.

openReadStream(entry, [options], callback)

entry must be an Entry object from this ZipFile. callback gets (err, readStream), where readStream is a Readable Stream that provides the file data for this entry. If this zipfile is already closed (see close()), the callback will receive an err.

options may be omitted or null, and has the following defaults:

{
  decompress: entry.isCompressed() ? true : null,
  decrypt: null,
  start: 0,                  // actually the default is null, see below
  end: entry.compressedSize, // actually the default is null, see below
}

If the entry is compressed (with a supported compression method), and the decompress option is true (or omitted), the read stream provides the decompressed data. Omitting the decompress option is what most clients should do.

The decompress option must be null (or omitted) when the entry is not compressed (see isCompressed()), and either true (or omitted) or false when the entry is compressed. Specifying decompress: false for a compressed entry causes the read stream to provide the raw compressed file data without going through a zlib inflate transform.

If the entry is encrypted (see isEncrypted()), clients may want to avoid calling openReadStream() on the entry entirely. Alternatively, clients may call openReadStream() for encrypted entries and specify decrypt: false. If the entry is also compressed, clients must also specify decompress: false. Specifying decrypt: false for an encrypted entry causes the read stream to provide the raw, still-encrypted file data. (This data includes the 12-byte header described in the spec.)

The decrypt option must be null (or omitted) for non-encrypted entries, and false for encrypted entries. Omitting the decrypt option (or specifying it as null) for an encrypted entry will result in the callback receiving an err. This default behavior is so that clients not accounting for encrypted files aren't surprised by bogus file data.

The start (inclusive) and end (exclusive) options are byte offsets into this entry's file data, and can be used to obtain part of an entry's file data rather than the whole thing. If either of these options are specified and non-null, then the above options must be used to obain the file's raw data. Specifying {start: 0, end: entry.compressedSize} will result in the complete file, which is effectively the default values for these options, but note that unlike omitting the options, when you specify start or end as any non-null value, the above requirement is still enforced that you must also pass the appropriate options to get the file's raw data.

It's possible for the readStream provided to the callback to emit errors for several reasons. For example, if zlib cannot decompress the data, the zlib error will be emitted from the readStream. Two more error cases (when validateEntrySizes is true) are if the decompressed data has too many or too few actual bytes compared to the reported byte count from the entry's uncompressedSize field. yauzl notices this false information and emits an error from the readStream after some number of bytes have already been piped through the stream.

This check allows clients to trust the uncompressedSize field in Entry objects. Guarding against zip bomb attacks can be accomplished by doing some heuristic checks on the size metadata and then watching out for the above errors. Such heuristics are outside the scope of this library, but enforcing the uncompressedSize is implemented here as a security feature.

It is possible to destroy the readStream before it has piped all of its data. To do this, call readStream.destroy(). You must unpipe() the readStream from any destination before calling readStream.destroy(). If this zipfile was created using fromRandomAccessReader(), the RandomAccessReader implementation must provide readable streams that implement a ._destroy() method according to https://nodejs.org/api/stream.html#writable_destroyerr-callback (see randomAccessReader._readStreamForRange()) in order for calls to readStream.destroy() to work in this context.

readLocalFileHeader(entry, [options], callback)

This is a low-level function you probably don't need to call. The intended use case is either preparing to call openReadStreamLowLevel() or simply examining the content of the local file header out of curiosity or for debugging zip file structure issues.

entry is an entry obtained from Event: "entry". An entry in this library is a file's metadata from a Central Directory Header, and this function gives the corresponding redundant data in a Local File Header.

options may be omitted or null, and has the following defaults:

{
  minimal: false,
}

If minimal is false (or omitted or null), the callback receives a full LocalFileHeader. If minimal is true, the callback receives an object with a single property and no prototype {fileDataStart: fileDataStart}. For typical zipfile reading usecases, this field is the only one you need, and yauzl internally effectively uses the {minimal: true} option as part of openReadStream().

The callback receives (err, localFileHeaderOrAnObjectWithJustOneFieldDependingOnTheMinimalOption), where the type of the second parameter is described in the above discussion of the minimal option.

openReadStreamLowLevel(fileDataStart, compressedSize, relativeStart, relativeEnd, decompress, uncompressedSize, callback)

This is a low-level function available for advanced use cases. You probably want openReadStream() instead.

The intended use case for this function is calling readEntry() and readLocalFileHeader() with {minimal: true} first, and then opening the read stream at a later time, possibly after closing and reopening the entire zipfile, possibly even in a different process. The parameters are all integers and booleans, which are friendly to serialization.

  • fileDataStart - from localFileHeader.fileDataStart
  • compressedSize - from entry.compressedSize
  • relativeStart - the resolved value of options.start from openReadStream(). Must be a non-negative integer, not null. Typically 0 to start at the beginning of the data.
  • relativeEnd - the resolved value of options.end from openReadStream(). Must be a non-negative integer, not null. Typically entry.compressedSize to include all the data.
  • decompress - boolean indicating whether the data should be piped through a zlib inflate stream.
  • uncompressedSize - from entry.uncompressedSize. Only used when validateEntrySizes is true. If validateEntrySizes is false, this value is ignored, but must still be present, not omitted, in the arguments; you have to give it some value, even if it's null.
  • callback - receives (err, readStream), the same as for openReadStream()

This low-level function does not read any metadata from the underlying storage before opening the read stream. This is both a performance feature and a safety hazard. None of the integer parameters are bounds checked. None of the validation from openReadStream() with respect to compression and encryption is done here either. Only the bounds checks from validateEntrySizes are done, because that is part of processing the stream data.

close()

Causes all future calls to openReadStream() to fail, and closes the fd, if any, after all streams created by openReadStream() have emitted their end events.

If the autoClose option is set to true (see open()), this function will be called automatically effectively in response to this object's end event.

If the lazyEntries option is set to false (see open()) and this object's end event has not been emitted yet, this function causes undefined behavior. If the lazyEntries option is set to true, you can call this function instead of calling readEntry() to abort reading the entries of a zipfile.

It is safe to call this function multiple times; after the first call, successive calls have no effect. This includes situations where the autoClose option effectively calls this function for you.

If close() is never called, then the zipfile is "kept open". For zipfiles created with fromFd(), this will leave the fd open, which may be desirable. For zipfiles created with open(), this will leave the underlying fd open, thereby "leaking" it, which is probably undesirable. For zipfiles created with fromRandomAccessReader(), the reader's close() method will never be called. For zipfiles created with fromBuffer(), the close() function has no effect whether called or not.

Regardless of how this ZipFile was created, there are no resources other than those listed above that require cleanup from this function. This means it may be desirable to never call close() in some usecases.

isOpen

Boolean. true until close() is called; then it's false.

entryCount

Number. Total number of central directory records.

comment

String. Always decoded with CP437 per the spec.

If decodeStrings is false (see open()), this field is the undecoded Buffer instead of a decoded String.

Class: Entry

Objects of this class represent Central Directory Records. Refer to the zipfile specification for more details about these fields.

These fields are of type Number:

  • versionMadeBy
  • versionNeededToExtract
  • generalPurposeBitFlag
  • compressionMethod
  • lastModFileTime (MS-DOS format, see getLastModDate())
  • lastModFileDate (MS-DOS format, see getLastModDate())
  • crc32
  • compressedSize
  • uncompressedSize
  • fileNameLength (in bytes)
  • extraFieldLength (in bytes)
  • fileCommentLength (in bytes)
  • internalFileAttributes
  • externalFileAttributes
  • relativeOffsetOfLocalHeader

These fields are of type Buffer, and represent variable-length bytes before being processed:

  • fileNameRaw
  • extraFieldRaw
  • fileCommentRaw

There are additional fields described below: fileName, extraFields, fileComment. These are the processed versions of the *Raw fields listed above. See their own sections below. (Note the inconsistency in pluralization of "field" vs "fields" in extraField, extraFields, and extraFieldRaw. Sorry about that.)

The new Entry() constructor is available for clients to call, but it's usually not useful. The constructor takes no parameters and does nothing; no fields will exist.

fileName

String. Following the spec, the bytes for the file name are decoded with UTF-8 if generalPurposeBitFlag & 0x800, otherwise with CP437. Alternatively, this field may be populated from the Info-ZIP Unicode Path Extra Field (see extraFields).

This field is automatically validated by validateFileName() before yauzl emits an "entry" event. If this field would contain unsafe characters, yauzl emits an error instead of an entry.

If decodeStrings is false (see open()), this field is the undecoded Buffer instead of a decoded String. Therefore, generalPurposeBitFlag and any Info-ZIP Unicode Path Extra Field are ignored. Furthermore, no automatic file name validation is performed for this file name.

extraFields

Array with each item in the form {id: id, data: data}, where id is a Number and data is a Buffer.

This library looks for and reads the ZIP64 Extended Information Extra Field (0x0001) in order to support ZIP64 format zip files.

This library also looks for and reads the Info-ZIP Unicode Path Extra Field (0x7075) in order to support some zipfiles that use it instead of General Purpose Bit 11 to convey UTF-8 file names. When the field is identified and verified to be reliable (see the zipfile spec), the file name in this field is stored in the fileName property, and the file name in the central directory record for this entry is ignored. Note that when decodeStrings is false, all Info-ZIP Unicode Path Extra Fields are ignored.

None of the other fields are considered significant by this library. Fields that this library reads are left unaltered in the extraFields array.

fileComment

String decoded with the charset indicated by generalPurposeBitFlag & 0x800 as with the fileName. (The Info-ZIP Unicode Path Extra Field has no effect on the charset used for this field.)

If decodeStrings is false (see open()), this field is the undecoded Buffer instead of a decoded String.

Prior to yauzl version 2.7.0, this field was erroneously documented as comment instead of fileComment. For compatibility with any code that uses the field name comment, yauzl creates an alias field named comment which is identical to fileComment.

getLastModDate()

Effectively implemented as the following. See dosDateTimeToDate().

return dosDateTimeToDate(this.lastModFileDate, this.lastModFileTime);

isEncrypted()

Returns is this entry encrypted with "Traditional Encryption". Effectively implemented as:

return (this.generalPurposeBitFlag & 0x1) !== 0;

See openReadStream() for the implications of this value.

Note that "Strong Encryption" is not supported, and will result in an "error" event emitted from the ZipFile.

isCompressed()

Effectively implemented as:

return this.compressionMethod === 8;

See openReadStream() for the implications of this value.

Class: LocalFileHeader

This is a trivial class that has no methods and only the following properties. The constructor is available to call, but it doesn't do anything. See readLocalFileHeader().

See the zipfile spec for what these fields mean.

  • fileDataStart - Number: inferred from fileNameLength, extraFieldLength, and this struct's position in the zipfile.
  • versionNeededToExtract - Number
  • generalPurposeBitFlag - Number
  • compressionMethod - Number
  • lastModFileTime - Number
  • lastModFileDate - Number
  • crc32 - Number
  • compressedSize - Number
  • uncompressedSize - Number
  • fileNameLength - Number
  • extraFieldLength - Number
  • fileName - Buffer
  • extraField - Buffer

Note that unlike Class: Entry, the fileName and extraField are completely unprocessed. This notably lacks Unicode and ZIP64 handling as well as any kind of safety validation on the file name. See also parseExtraFields().

Also note that if your object is missing some of these fields, make sure to read the docs on the minimal option in readLocalFileHeader().

Class: RandomAccessReader

This class is meant to be subclassed by clients and instantiated for the fromRandomAccessReader() function.

An example implementation can be found in test/test.js.

randomAccessReader._readStreamForRange(start, end)

Subclasses must implement this method.

start and end are Numbers and indicate byte offsets from the start of the file. end is exclusive, so _readStreamForRange(0x1000, 0x2000) would indicate to read 0x1000 bytes. end - start will always be at least 1.

This method should return a readable stream which will be pipe()ed into another stream. It is expected that the readable stream will provide data in several chunks if necessary. If the readable stream provides too many or too few bytes, an error will be emitted. (Note that validateEntrySizes has no effect on this check, because this is a low-level API that should behave correctly regardless of the contents of the file.) Any errors emitted on the readable stream will be handled and re-emitted on the client-visible stream (returned from zipfile.openReadStream()) or provided as the err argument to the appropriate callback (for example, for fromRandomAccessReader()).

If you call readStream.destroy() on streams you get from openReadStream(), the returned stream must implement a method ._destroy() according to https://nodejs.org/api/stream.html#writable_destroyerr-callback . If you never call readStream.destroy(), then streams returned from this method do not need to implement a method ._destroy(). ._destroy() should abort any streaming that is in progress and clean up any associated resources. ._destroy() will only be called after the stream has been unpipe()d from its destination.

Note that the stream returned from this method might not be the same object that is provided by openReadStream(). The stream returned from this method might be pipe()d through one or more filter streams (for example, a zlib inflate stream).

randomAccessReader.read(buffer, offset, length, position, callback)

Subclasses may implement this method. The default implementation uses createReadStream() to fill the buffer.

This method should behave like fs.read().

randomAccessReader.close(callback)

Subclasses may implement this method. The default implementation is effectively setImmediate(callback);.

callback takes parameters (err).

This method is called once the all streams returned from _readStreamForRange() have ended, and no more _readStreamForRange() or read() requests will be issued to this object.

How to Avoid Crashing

When a malformed zipfile is encountered, the default behavior is to crash (throw an exception). If you want to handle errors more gracefully than this, be sure to do the following:

  • Provide callback parameters where they are allowed, and check the err parameter.
  • Attach a listener for the error event on any ZipFile object you get from open(), fromFd(), fromBuffer(), or fromRandomAccessReader().
  • Attach a listener for the error event on any stream you get from openReadStream().

Minor version updates to yauzl will not add any additional requirements to this list.

Limitations

The automated tests for this project run on node versions 12 and up. Older versions of node are not supported.

Files corrupted by the Mac Archive Utility are not my problem

For a lengthy discussion, see issue #69. In summary, the Mac Archive Utility is buggy when creating large zip files, and this library does not make any effort to work around the bugs. This library will attempt to interpret the zip file data at face value, which may result in errors, or even silently incomplete data. If this bothers you, that's good! Please complain to Apple. :) I have accepted that this library will simply not support that nonsense.

No Streaming Unzip API

Due to the design of the .zip file format, it's impossible to interpret a .zip file from start to finish (such as from a readable stream) without sacrificing correctness. The Central Directory, which is the authority on the contents of the .zip file, is at the end of a .zip file, not the beginning. A streaming API would need to either buffer the entire .zip file to get to the Central Directory before interpreting anything (defeating the purpose of a streaming interface), or rely on the Local File Headers which are interspersed through the .zip file. However, the Local File Headers are explicitly denounced in the spec as being unreliable copies of the Central Directory, so trusting them would be a violation of the spec.

Any library that offers a streaming unzip API must make one of the above two compromises, which makes the library either dishonest or nonconformant (usually the latter). This library insists on correctness and adherence to the spec, and so does not offer a streaming API.

Here is a way to create a spec-conformant .zip file using the zip command line program (Info-ZIP) available in most unix-like environments, that is (nearly) impossible to parse correctly with a streaming parser:

$ echo -ne '\x50\x4b\x07\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' > file.txt
$ zip -q0 - file.txt | cat > out.zip

This .zip file contains a single file entry that uses General Purpose Bit 3, which means the Local File Header doesn't know the size of the file. Any streaming parser that encounters this situation will either immediately fail, or attempt to search for the Data Descriptor after the file's contents. The file's contents is a sequence of 16-bytes crafted to exactly mimic a valid Data Descriptor for an empty file, which will fool any parser that gets this far into thinking that the file is empty rather than containing 16-bytes. What follows the file's real contents is the file's real Data Descriptor, which will likely cause some kind of signature mismatch error for a streaming parser (if one hasn't occurred already).

By using General Purpose Bit 3 (and compression method 0), it's possible to create arbitrarily ambiguous .zip files that distract parsers with file contents that contain apparently valid .zip file metadata.

Limited ZIP64 Support

For ZIP64, only zip files smaller than 8PiB are supported, not the full 16EiB range that a 64-bit integer should be able to index. This is due to the JavaScript Number type being an IEEE 754 double precision float.

The Node.js fs module probably has this same limitation.

ZIP64 Extensible Data Sector Is Ignored

The spec does not allow zip file creators to put arbitrary data here, but rather reserves its use for PKWARE and mentions something about Z390. This doesn't seem useful to expose in this library, so it is ignored.

No Multi-Disk Archive Support

This library does not support multi-disk zip files. The multi-disk fields in the zipfile spec were intended for a zip file to span multiple floppy disks, which probably never happens now. If the "number of this disk" field in the End of Central Directory Record is not 0, the open(), fromFd(), fromBuffer(), or fromRandomAccessReader() callback will receive an err. By extension the following zip file fields are ignored by this library and not provided to clients:

  • Disk where central directory starts
  • Number of central directory records on this disk
  • Disk number where file starts

Limited Encryption Handling

You can detect when a file entry is encrypted with "Traditional Encryption" via isEncrypted(), but yauzl will not help you decrypt it. See openReadStream().

If a zip file contains file entries encrypted with "Strong Encryption", yauzl emits an error.

If the central directory is encrypted or compressed, yauzl emits an error.

Local File Headers Are Ignored

Many unzip libraries mistakenly read the Local File Header data in zip files. This data is officially defined to be redundant with the Central Directory information, and is not to be trusted. Aside from checking the signature, yauzl ignores the content of the Local File Header.

No CRC-32 Checking

This library provides the crc32 field of Entry objects read from the Central Directory. However, this field is not used for anything in this library.

versionNeededToExtract Is Ignored

The field versionNeededToExtract is ignored, because this library doesn't support the complete zip file spec at any version,

No Support For Obscure Compression Methods

Regarding the compressionMethod field of Entry objects, only method 0 (stored with no compression) and method 8 (deflated) are supported. Any of the other 15 official methods will cause the openReadStream() callback to receive an err.

Data Descriptors Are Ignored

There may or may not be Data Descriptor sections in a zip file. This library provides no support for finding or interpreting them.

Archive Extra Data Record Is Ignored

There may or may not be an Archive Extra Data Record section in a zip file. This library provides no support for finding or interpreting it.

No Language Encoding Flag Support

Zip files officially support charset encodings other than CP437 and UTF-8, but the zip file spec does not specify how it works. This library makes no attempt to interpret the Language Encoding Flag.

How Ambiguities Are Handled

The zip file specification has several ambiguities inherent in its design. Yikes!

  • The .ZIP file comment must not contain the end of central dir signature bytes 50 4b 05 06. This corresponds to the text "PK☺☻" in CP437. While this is allowed by the specification, yauzl will hopefully reject this situation with an "Invalid comment length" error. However, in some situations unpredictable incorrect behavior will ensue, which will probably manifest in either an invalid signature error or some kind of bounds check error, such as "Unexpected EOF".
  • In non-ZIP64 files, the last central directory header must not have the bytes 50 4b 06 07 ("PK♠•" in CP437) exactly 20 bytes from its end, which might be in the file name, the extra field, or the file comment. The presence of these bytes indicates that this is a ZIP64 file.

Change History

  • 3.1.3

    • Fixed a crash when using fromBuffer() to read corrupt zip files that specify out of bounds file offsets. issue #156
    • Enahnced the test suite to run the error tests through fromBuffer() and fromRandomAccessReader() in addition to open(), which would have caught the above.
  • 3.1.2

    • Fixed handling non-64 bit entries (similar to the version 3.1.1 fix) that actually have exactly 0xffffffff values in the fields. This fixes erroneous "expected zip64 extended information extra field" errors. issue #109
  • 3.1.1

    • Fixed handling non-64 bit files that actually have exactly 0xffff or 0xffffffff values in End of Central Directory Record. This fixes erroneous "invalid zip64 end of central directory locator signature" errors. issue #108
    • Fixed handling of 64-bit zip files that put 0xffff or 0xffffffff in every field overridden in the Zip64 end of central directory record even if the value would have fit without overflow. In particular, this fixes an incorrect "multi-disk zip files are not supported" error. pull #118
  • 3.1.0

    • Added readLocalFileHeader() and Class: LocalFileHeader.
    • Added openReadStreamLowLevel().
    • Added getFileNameLowLevel() and parseExtraFields(). Added fields to Class: Entry: fileNameRaw, extraFieldRaw, fileCommentRaw.
    • Added examples/compareCentralAndLocalHeaders.js that demonstrate many of these low level APIs.
    • Noted dropped support of node versions before 12 in the "engines" field of package.json.
    • Fixed a crash when calling openReadStream() with an explicitly null options parameter (as opposed to omitted).
  • 3.0.0

    • BREAKING CHANGE: implementations of RandomAccessReader that implement a destroy method must instead implement _destroy in accordance with the node standard https://nodejs.org/api/stream.html#writable_destroyerr-callback (note the error and callback parameters). If you continue to override destory instead, some error handling may be subtly broken. Additionally, this is required for async iterators to work correctly in some versions of node. issue #110
    • BREAKING CHANGE: Drop support for node versions older than 12.
    • Maintenance: Fix buffer deprecation warning by bundling fd-slicer with a 1-line change, rather than depending on it. issue #114
    • Maintenance: Upgrade bl dependency; add package-lock.json; drop deprecated istanbul dependency. This resolves all security warnings for this project. pull #125
    • Maintenance: Replace broken Travis CI with GitHub Actions. pull #148
    • Maintenance: Fixed a long-standing issue in the test suite where a premature exit would incorrectly signal success.
    • Officially gave up on supporting Mac Archive Utility corruption in order to rescue my motivation for this project. issue #69
  • 2.10.0

    • Added support for non-conformant zipfiles created by Microsoft, and added option strictFileNames to disable the workaround. issue #66, issue #88
  • 2.9.2

    • Removed tools/hexdump-zip.js and tools/hex2bin.js. Those tools are now located here: thejoshwolfe/hexdump-zip and thejoshwolfe/hex2bin
    • Worked around performance problem with zlib when using fromBuffer() and readStream.destroy() for large compressed files. issue #87
  • 2.9.1

    • Removed console.log() accidentally introduced in 2.9.0. issue #64
  • 2.9.0

    • Throw an exception if readEntry() is called without lazyEntries:true. Previously this caused undefined behavior. issue #63
  • 2.8.0

    • Added option validateEntrySizes. issue #53
    • Added examples/promises.js
    • Added ability to read raw file data via decompress and decrypt options. issue #11, issue #38, pull #39
    • Added start and end options to openReadStream(). issue #38
  • 2.7.0

    • Added option decodeStrings. issue #42
    • Fixed documentation for entry.fileComment and added compatibility alias. issue #47
  • 2.6.0

    • Support Info-ZIP Unicode Path Extra Field, used by WinRAR for Chinese file names. issue #33
  • 2.5.0

    • Ignore malformed Extra Field that is common in Android .apk files. issue #31
  • 2.4.3

    • Fix crash when parsing malformed Extra Field buffers. issue #31
  • 2.4.2

    • Remove .npmignore and .travis.yml from npm package.
  • 2.4.1

    • Fix error handling.
  • 2.4.0

  • 2.3.1

    • Documentation updates.
  • 2.3.0

    • Check that uncompressedSize is correct, or else emit an error. issue #13
  • 2.2.1

    • Update dependencies.
  • 2.2.0

    • Update dependencies.
  • 2.1.0

    • Remove dependency on iconv.
  • 2.0.3

    • Fix crash when trying to read a 0-byte file.
  • 2.0.2

    • Fix event behavior after errors.
  • 2.0.1

    • Fix bug with using iconv.
  • 2.0.0

    • Initial release.

Development

One of the trickiest things in development is crafting test cases located in test/{success,failure}/. These are zip files that have been specifically generated or design to test certain conditions in this library. I recommend using hexdump-zip to examine the structure of a zipfile.

For making new error cases, I typically start by copying test/success/linux-info-zip.zip, and then editing a few bytes with a hex editor.

yauzl's People

Contributors

amilajack avatar andrewrk avatar mjomble avatar neverendingqs avatar overlookmotel avatar silverwind avatar styfle avatar tgabi333 avatar thejoshwolfe avatar wcmonty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yauzl's Issues

non-fs API

hey I have a weird request. I wrote this https://github.com/maxogden/punzip for a somewhat common but annoying use case: given a large zip on a server, only extract a single file from it, as a stream, without downloading the whole thing.

here's more detail on the use case: https://gist.github.com/maxogden/11a85ae12074fed0b9f6

the cool thing is that it totally works! I can mount a 500mb zip, point yauzl at it, and my code translates yauzls calls into HTTP range calls like this:

  mount-url requested +542ms 514105344-514170879 received 65536 bytes
  mount-url requested +173ms 514170880-514172204 received 1325 bytes

those were yauzl getting the entry table at the end of the file (I think).

unfortunately I had to use fuse to make it compatible with yauzl. It would be nice, though, if I could give yauzl a function with e.g. 'getBytes(offset, length)` or something and it would be able to use that as the data source rather than a file descriptor/path to a file.

i'm open to any suggestions or ideas you might have for this use case!

validateFileName problem?

hi ,respect for @thejoshwolfe

just now, i got a problem, when i use zipfile.readEntry, it got a " invalid characters in fileName" ,so I trace code, find blow code
20170302000512

in index.js
but "fileName" are Common formats and have no error like "path\to\file.extension" ,
so i add a single line of code " fileName=fileName.replace(/\/g,'/') " at function's start ,and it's ok .
env:
OS: windows 10 ,×64
software: nodejs in nw.js
my question is why " invalid characters in fileName" come out ?

you got my respect, and thank you will be resolving my question.

destroying a readStream for a compressed entry may cause a memory leak

I'm not actually sure this is a problem, but I think it is. The zlib module appears to offer no way of aborting an inflate stream in progress. Currently, yauzl just unpipes and abandons it hoping that the GC will do something helpful. But since there are C resources supporting the JavaScript API, this is probably causing a memory leak.

file date

how to keep the file's origin date when unzip? and all origin attributes. when I do unzip the files,and use yazl zip the files, and compare the two zip files md5 does not equel.

Comments are CP437?

Hey @thejoshwolfe,

First off, you are awesome. Love your approach with this library.

Going through the code, I was curious why you are decoding comment text using CP437 encoding, I couldn't find a reference to this encoding in the PKWare spec. Should UTF8 just work fine?

encrypted zip files should not have undefined behavior

From the docs:

Currently, the presence of encryption is not even checked, and encrypted zip files will cause undefined behavior.

encrypted zip files should, at the very least, have well-defined behavior. for example, returning an error and not crashing.

Same with ZIP64 files. The docs say the behavior is undefined, but the behavior should be defined to give an error and not crash.

No entry for containing directory?

If there is a single directory inside the archive that's wrapping everything else, yauzl doesn't seem to emit an entry for that directory.

Example zip structure:

test/
    file.txt
    subdir/
        file2.txt

Test program:

var yauzl = require("yauzl");

yauzl.open(__dirname + "/test.zip", {lazyEntries: true}, function (err, zipfile) {
	if (err) throw err;
	zipfile.readEntry();
	zipfile.on("entry", function(entry) {
		console.log(entry.fileName);
		zipfile.readEntry();
	});
});

Output:

test/file.txt
test/subdir/
test/subdir2/
test/subdir2/file2.txt

I was expecting the output to start with test/.

The zip file is created using default Windows right-click thing. You can find the demo app and zip file I used here: https://github.com/panta82/yauzl-test

add lazyEntries option

One of the main goals of yauzl is to keep memory usage under control. The current pattern of entry and end events is hostile to this goal.

Using examples/unzip.js on http://wolfesoftware.com/crowded.zip.xz (unxz it first) results in over 2GB of virtual ram allocation as the program slows to a crawl while creating empty directories. This is failure. The problem is that we call openReadStream (0bf7f48#diff-5b07aa03e052f091324dde1dfbfd25bfR53) on every entry before pumping any of the file data. This is the naive solution to the problem that #18 ran into in the test system where autoClose can lead to race conditions.

Really, the problem is that autoClose expects you to call all your openReadStream()s before yauzl emits the end event, and yauzl races to emit the end event as soon as possible. Even with autoClose off, you need to keep all your Entry objects in ram as yauzl emits them, or you'll lose the information.

We need some way to throttle entry reading.

Also, it's questionable that yauzl's API encourages clients to extract all files in parallel. This would probably be worse performance on any file system that tries to pre-fetch data ahead of your read() calls (pretty much all of them, I think) than the normal strategy of extracting one file at a time from start to finish. I didn't test that, but it seems like in general file system access wouldn't necessarily benefit from parallelization; in fact, massive parallelization probably makes things much worse due to cache thrashing and the extra resources required to keep all the pipelines open.

It will be nice to remove examples/unzip.js's dependency on a third-party synchronization library pend, which is currently being used to iterate over the entries one at a time despite yauzl's API giving us all the entries in parallel. It's a smell that our own example doesn't use the recommended execution order.

Test error events emitted from yauzl internal objects

Hi,

I wrote a module for a project relying on yauzl in order to extract a zip file.
Extraction works.
However I am trying to reach 100% code coverage.
I am having hard time to figure how to test error events emitted by the ZipFile object as well as streams errors which could occurs within openReadStream.

My module expose a function through its public API :

function extractZip(targetPath, zipfilePath, fd, done) {
  
  const yauzlOptions = {"autoClose": true, "lazyEntries": true};

    yauzl.fromFd(fd, yauzlOptions, function processZipFile(err, zipfile) {
      if (err) {
        // This part was tested using a mocking object with proxyquire      
      }
      zipfile.readEntry();

      zipfile.on('error', function zipFileErrorEventHandler(err) {
         // That's the part of the code I would like to cover
      });
    }
}

I was thinking maybe I should stub/mock the ZipFile object and emit the error event but I can't figure it out what would be the best way or the simplest way to achive that ?

Any hint regarding this ?

Thanks.

throw errors when clients call the api wrong

The docs currently talk about a lot of "undefined behavior" (see especially readEntry). It would probably be more convenient for clients learning how to use this library if yauzl would throw an explicit exception instead of charging headlong into undefined behavior.

When reading zip file from buffer, "end" is fired before last "entry" event

var getRawBody = require('raw-body')
var yauzl = require("yauzl");
var q = require('q');

var streamToBuffer = function(sourceStream) {
var deferred = q.defer();

getRawBody(sourceStream, {}, function (err, buffer) {
    if (err) {
        deferred.reject(err);
    } else {
        deferred.resolve(buffer);
    }
});

return deferred.promise;

}

streamToBuffer(fileStream).then(function(buffer) {
    yauzl.fromBuffer(buffer, function(err, zipfile) {
        if (err) {
            console.log('error parsing zip file', error);
            throw err;
        }

        zipfile.on('entry', function (entry) {
            console.log('zip file entry: ' + entry.fileName);
        }).on('error', function (error) {
            console.log('error parsing zip file', error);
        }).on('end', function () {
            console.log('done with zip file');
        });

directory safety considerations

  • The README should reassure the API user that yauzl is protecting them against absolute paths in directory names so they can avoid writing those safety checks. Probably both in the usage example and in the API docs below.
  • What about directory names that have .. in them? E.g. ../../../gotcha/sucker

btw, use /foo/.test() instead of /foo/.exec() when you want a simple boolean answer.

add byte range options to openReadStream

Sort of thinking out loud here. I have a use case where I want to 'mount' a compressed archive and access bytes randomly without decompressing the whole archive up front. Basically I want to:

  1. Efficiently get the entry that matches some filename
  2. Read a byte range from that entry
  3. Repeat this many times, potentially reading the same entry multiple times

The yauzl API seems to be geared for single pass unzipping, which makes sense. One approach I was thinking is I could just get all on('entry') entries up front and keep them in memory, then when a byte range request comes in I can use the entry to retrieve the byte range, but I ran in to problems, it would be much nicer to be able to lazily consult the central directory as opposed to having to read it all up front.

The other issue is related to Deflate which requires decompression from the beginning of the entry. I guess an alternative compression type like BGZF would make arbitrary byte range lookups much faster, but it wouldn't be compatible with many implementations. However! I found another technique where you do a single pass over the entry and build an index (https://github.com/madler/zlib/blob/master/examples/zran.c). I think this would be acceptable for my use case.

Being able to implement the zran style indexing on top of yauzl would mean some API changes I think, e.g. a way to get a single entry from the CDR by name, and a lower level way to control the decompression state to support zran. Before I got too deep I wanted to sanity check this use case, does it seem doable?

ENOENT

events.js:85
throw er; // Unhandled 'error' event
^
Error: ENOENT, open 'TestZip/001.png'

at Error (native)

When running example script on any file I get this. I don't know if it's really an error, but I'm using the example script exactly as written changing only the path to my .zip file.

Allow settings custom character encoding for files within the zip

In the code below, what is the encoding of the file referred to by readStream?

    yauzl.open('file.zip', (err, zipfile) => {
      if (err) throw err
      zipfile.on('entry', (entry) => {
        zipfile.openReadStream(entry, (err, readStream) => {
          if (err) throw err
          // What is readStream encoding? 
        })
      })
    })

The README says

Zip files officially support charset encodings other than CP437 and UTF-8, but the zip file spec does not specify how it works. This library makes no attempt to interpret the Language Encoding Flag.

It would be helpful if there can be a way to tell yauzl the encoding of the file if you know what it is. It's more efficient if yauzl does it directly than having to pipe it through a conversion stream like iconv-lite.

End of Central Directory Record signature not found for base64-encoded zipfile

Real Case (background info):

I need to receive zipped (6z) base64 data and unzip them.

Test Case:

Hello, I zipped a pages doc in a folder with the mac standard software (osx el capitan). Then I encoded this file to base64 with an online converter: http://base64converter.com/ and sent it to my app.

Code

Now I tried to unzip this file:

exports.unzip = (base64, callback)->
  buffer = new Buffer(base64, 'base64')
  yauzl.fromBuffer buffer, { lazyEntries: true }, (err, zipfile) ->
    if err?
      throw err
    zipfile.readEntry()
    zipfile.on 'entry', (entry) ->
      if /\/$/.test(entry.fileName)
        # directory file names end with '/'
        mkdirp entry.fileName, (err) ->
          if err
            throw err
          zipfile.readEntry()
          return
      else
        # file entry
        zipfile.openReadStream entry, (err, readStream) ->
          if err
            throw err
          # ensure parent directory exists
          mkdirp path.dirname(entry.fileName), (err) ->
            if err
              throw err
            console.log "SAVING, entry.fileName #{entry.fileName}"
            readStream.pipe fs.createWriteStream(entry.fileName)
            readStream.on 'end', ->
              zipfile.readEntry()
              return callback()
            return
          return
      return
    return

## Base64

This is the base64:


Zip

This is the zip file:
Test Doc.zip

Directory entry size validation

I use your amaizing module to uncompress zip archives from different sources. Archives may be produced via any compression tools and I don't exactly know which tool was used.

I've found a problem on some archives: readEntry() method fails on directory processing on entry size validation:

      if (entry.compressionMethod === 0) {
        if (entry.compressedSize !== entry.uncompressedSize) {
          var msg = "compressed/uncompressed size mismatch for stored file: " + entry.compressedSize + " != " + entry.uncompressedSize;
          return emitErrorAndAutoClose(self, new Error(msg));
        }
      }

Looks like some zip tool constantly set the following local headers for a directory entry:
compressionMethod: 0
compressedSize: 0
uncompressedSize: 4096

Unfortunatelly, I cannot provide zip example for this case because of sensitive data.

What do you think about some boolean option: validate size or skip validation? Or validate directory size or skip validation? If you accept this proposition I'll create pull request with my pleasure.

test suite failure

npm install && npm test fails for me:

test/success/linux-info-zip.zip(buffer): pass

/Users/maxogden/src/js/yauzl/test/test.js:55
              throw new Error(messagePrefix + "not supposed to exist");
                    ^
Error: test/success/unicode.zip(fd): Turmion Kätilöt/: not supposed to exist
    at /Users/maxogden/src/js/yauzl/test/test.js:55:21
    at pendGo (/Users/maxogden/src/js/yauzl/node_modules/pend/index.js:30:3)
    at Pend.go (/Users/maxogden/src/js/yauzl/node_modules/pend/index.js:13:5)
    at ZipFile.<anonymous> (/Users/maxogden/src/js/yauzl/test/test.js:51:27)
    at ZipFile.EventEmitter.emit (events.js:95:17)
    at /Users/maxogden/src/js/yauzl/index.js:237:12
    at /Users/maxogden/src/js/yauzl/index.js:329:5
    at /Users/maxogden/src/js/yauzl/node_modules/fd-slicer/index.js:28:7
    at Object.wrapper [as oncomplete] (fs.js:454:17)
npm ERR! Test failed.  See above for more details.
npm ERR! not ok code 0

this happens for a bunch of the zip files in the success/ folder. if I delete them then npm test finally passes.

could you set up travis CI on this repo? npm install -g travisjs && travisjs init

streaming unzip?

I just wanted to see if you guys thought about supporting a pure unzip use case, e.g.

cat file.zip | unzip

It seems thats the approach taken by node-unzip: https://github.com/EvanOxfeld/node-unzip/blob/5a62ecbcef6523708bb8b37decaf6e41728ac7fc/lib/parse.js#L51-L57

They basically scan through the zip beginning -> and and when they see different signatures they emit entries, whereas the yauzl approach is to first skip to the end and read the CDR and emit entries from that.

The nice thing about a pure stream API is you can use unix pipes and not require random access in order to unzip. On the other hand, the nice thing about reading the CDR first is you can do e.g. lazy file mounting + extracting use cases over HTTP like I did in the issue yesterday.

I'm just curious if there are other reasons why streaming unzipping isn't implemented in yauzl other than its currently a nice compact 400 line implementation and supporting two APIs would be tedious. Cheers :)

Support closing a zipfile partway through reading entries

It's possible to have several million entries in a zipfile. Reading them all can be slow, and a client may want to look for just one and then abort reading the entries partway through by calling close() instead of another readEntry(). The docs currently call this undefined behavior, but it should probably just work.

entry.fileComment is documented wrong

In the README, the field is called entry.comment. In the code, it's called entry.fileComment. It's possible that users have written code that relies on either on of these, so we should make them both exist as aliases of one another.

The README will now document that entry.fileComment is the recommended way, and entry.comment is deprecated.

We should also add a test for file comments. If it's in the README, it needs first-class support, and that means tests.

race condition in test.js: Error: closed

After some amount of normal test output:

/home/josh/dev/yauzl/test/test.js:78
                if (err) throw err;
                               ^
Error: closed
    at ZipFile.openReadStream (/home/josh/dev/yauzl/index.js:262:37)
    at /home/josh/dev/yauzl/test/test.js:77:23
    at pendGo (/home/josh/dev/yauzl/node_modules/pend/index.js:54:3)
    at onCb (/home/josh/dev/yauzl/node_modules/pend/index.js:41:7)
    at AssertByteCountStream.<anonymous> (/home/josh/dev/yauzl/test/test.js:91:19)
    at AssertByteCountStream.emit (events.js:129:20)
    at _stream_readable.js:908:16
    at process._tickCallback (node.js:355:11)
npm ERR! Test failed.  See above for more details.

The problem seems to be that we are deferring openReadStream until after entityProcessing.go lets us start. Since we're using yauzl's autoClose: true, we're missing the window of time when we can call openReadStream.

don't throw exceptions for malformed zipfiles

There are numerous ways for maliciously constructed .zip files to crash whatever program is using yauzl. One example is an invalid UTF8 filename encoding.

All such error cases should provide the err in callbacks or emit the error event instead of allowing exceptions to bubble out of the library and crash the client program.

Invalid comment length

I'm having trouble unzipping a zip file uploaded by a user. The zip opens fine in any other unzip software I've tried. The error I'm getting is:

Error: invalid comment length. expected: 12298. found: 0
    at /usr/src/app/node_modules/yauzl/index.js:125:25
    at /usr/src/app/node_modules/yauzl/index.js:539:5
    at /usr/src/app/node_modules/fd-slicer/index.js:32:7
    at FSReqWrap.wrapper [as oncomplete] (fs.js:681:17)

If I comment out line 125 of index.js where the error is thrown, the file does seem to unzip properly. Any thoughts?

CRC-32 checks

Readme says CRC32 check not performed. Is that an option or do you not plan to support it?

Concat shredded file content strings by default?

  zipfile.on("entry", function(entry) {
          if (/\/$/.exec(entry)) return;
          zipfile.openReadStream(entry, options, function(err, readStream) {
            if (err) throw err;
            readStream.on('data', data => {
              console.log(entry.fileName) 
              console.log(data.toString())
            })
           })
          zipfile.readEntry()
        });

I am extracting xml files from zip and when I log them inside on('data' event they are multiparted. Basically in logs, fileName gets repeated and data is shredded in parts.

Is there builtin method for concating this data as single strings that are ready to be parsed as whole?

option to emit raw string buffers instead of decoded strings

I'm using version 2.6.0 FWIW, node 6.

± node debug bin.js foo.zip
< Debugger listening on [::]:5858
connecting to 127.0.0.1:5858 ... ok
break in bin.js:2
  1
> 2 'use strict'
  3 const extractExec = require('./')
  4 const fs = require('fs')
c
break in index.js:46
 44     // TODO: what if we get multiple plists?
 45     const plist = plists[0]
>46     debugger
 47     getExecStream(fd, plist.CFBundleExecutable, (err, entry, exec) => {
 48       debugger
c
break in index.js:19
 17     zip.on('entry', function onentry (entry) {
 18       if ((/XXXThing.*app\/XXXThing-.*/i).test(entry.fileName)) {
>19         debugger;
 20       }
 21       if (!isOurExec(entry, execname)) { return }
repl
Press Ctrl + C to leave debug repl
> entry.fileName
'Payload/XXXThing-╬▓.app/XXXThing-╬▓'
> execname
'XXXThing-β'

as you can see the execname is right but the entry.fileName is not right utf-8 AFAICT.

Some directory file names isn`t end with /

Steps to Reproduce:
Download: https://github.com/f111fei/test-files/raw/b124ebeb0ac1378f1335b0601485bd926743031f/yauzl-test.zip

Run: node index.js

Get Output:

Current FileName: extension/1
Current FileName: extension/1/2
c:\Users\xzper\Desktop\yauzl-test\index.js:23
if (err) throw err;
^

Error: EEXIST: file already exists, mkdir 'c:\Users\xzper\Desktop\yauzl-tes
t\extension\1'
at Error (native)

entry is a directory, but entry.filename isn`t end with '/'

Executables fail to run if started too soon after being unzipped

This is pseudocode with lots of stuff like error handling removed.

yauzl.fromFd(fd, { lazyEntries: true }, (err, zipfile) => {
                zipfile.readEntry();
                zipfile.on('entry', (entry: yauzl.Entry) => {
                        zipfile.openReadStream(entry, (err, readStream) => {
                            mkdirp.mkdirp(path.dirname(path), { mode: 0o775 }, (err) => {
                                readStream.pipe(fs.createWriteStream(path, { mode: fileMode }));
                                readStream.on('end', () => {
                                    zipfile.readEntry();
                                });
                            });
                        });
                    }
                });
                zipfile.on('end', () => {
                    resolve();
                });
            });
        })
err = process.spawn();

The bug is that the process.spawn returns -4082 (0xFFFFF00E), which means the process isn't finished being written yet. It looks like something isn't flushing the buffer soon enough. Sitting in a busy wait loop will eventually cause the process.spawn to succeed. I tried to somehow close or flush the stream buffer but nothing fixed it.

We hit this bug with the implementation of our C++ VS Code extension, but we worked around it by installing the .exe we need to launch before installing the the other executables.

ability to abort a pipeline from openReadStream

The unpipe branch currently has a test failure due to trying to walk away from an 8GB read stream after reading 0x100 bytes. The pipe is left open, and yazul never emits the close event.

In general, we should have the ability to abort a read stream partway through. According to my research, we need to provide this feature rather than relying on some existing Node solution that works for all generic streams and pipes.

Credit to @timotm for finding this limitation 8 months ago in #13.

How do I stop traversing but still have an end event fire?

How do I stop traversing the zip but still have the zip 'end' event fire?
it seems that zipfile.readEntry() is the only thing that invokes the zip.end event,
but i dont want to call it when I have what I want, and I DO want the end event to be called.

autoClose: false doesnt help. teh close method doesnt fire for fromBuffer.
what am I missing?

var yauzl = require("yauzl");
var fs = require("fs");
var path = require("path");
var mkdirp = require("mkdirp"); // or similar

 yauzl.fromBuffer(buffer,{lazyEntries: true}, function(err, zip) {
  if (err) throw err;
  zipfile.readEntry();
  zipfile.on("entry", function(entry) {
    if (/\/$/.test(entry.fileName)) {
        //do something here then stop walking the zip file.

    } else {
     zipfile.readEntry();
    }
  });

    zip.once("end", function() {
     //clean up. this never fires if we found our target file.
    });

});




nice to have: better test coverage for ZIP64 error cases

We've only got 1 test currently for ZIP64 and it's a simple passing test. We should also test corner cases such as:

  • file size is reported to be larger than Number.MAX_SAFE_INTEGER.
  • offsets to metadata and file data are out of bounds for the file size.

RangeError in buffer.js due to error at index.js:283

I was getting the following stack trace while unzipping a file:

buffer.js:620
throw new RangeError('index out of range');
^

RangeError: index out of range
at checkOffset (buffer.js:620:11)
at Buffer.readUInt16LE (buffer.js:666:5)
at /usr/local/lib/node_modules/yauzl/index.js:286:41
at /usr/local/lib/node_modules/yauzl/index.js:474:5
at /usr/local/lib/node_modules/yauzl/node_modules/fd-slicer/index.js:32:7
at FSReqWrap.wrapper as oncomplete

I tracked it down to line 283:
while (i < extraFieldBuffer.length) {

which should instead be at least:
while (i+4 < extraFieldBuffer.length) {

to avoid attempting to read past the end of the buffer at line 284 and 285.

However, I think you should add another check before line 289 to ensure the extraFieldBuffer.copy does not fail due to an invalid size field in the zip file. That is, if it is invalid, you should throw a descriptive exception rather than letting it be handled by another RangeCheck error (which doesn't explain the problem very well to the casual user of a damaged zip file).

(Note that there appears to be no easy way to trap this exception since it occurs within FSReqWrap wrapper).

No Streaming (Question / Brain storming)

No Streaming Unzip API

Due to the design of the .zip file format, it's impossible to interpret a .zip file from start to finish (such as >from a readable stream) without sacrificing correctness. The Central Directory, which is the authority on >the contents of the .zip file, is at the end of a .zip file, not the beginning.

I really don't know jack about the zip format, but I am just trying to think out of the box here for possibly having a streaming api.

Since the CD is at the end can we not read the file stream in reverse? Thus then detect the CD.

The main use case I am looking is for getting a single large file out of many in a zip using streaming if possible.

unsupport unzip java uploaded .zip file

when unzip a java uploaded .zip file from buffer , throw error "invalid comment length. expected: 2. found: 0", but when we comment the bellow verify code, and the file unzips success:

if (commentLength !== expectedCommentLength) {
        return callback(new Error("invalid comment length. expected: " + expectedCommentLength + ". found: " + commentLength));
      }

about java upload .zip file, the buffer count is less then node.js: http://stackoverflow.com/questions/34643660/missing-bytes-when-uploading-zip-file-with-multipart-form-data-using-java

can we just deprecate the verify code above ?

Support a promise paradigm

Before

    yauzl.open('file.zip', (err, zipfile) => {
      if (err) throw err
      zipfile.on('entry', (entry) => {
        zipfile.openReadStream(entry, (err, readStream) => {
          if (err) throw err
          readStream.pipe(somewhere)
        })
      })
    })

After

    const zipfile = await yauzl.open('file.zip')
    zipfile.on('entry', async (entry) => {
      const readStream = zipfile.openReadStream(entry)
        readStream.pipe(somewhere)
      })
    })

If yauzl.open and zipfile.openReadStream return Promises if no callbacks are provided, it would simplify code for people who choose to use async/await. We could remove all if (err) throw err because unhandled errors would be thrown automatically. And we don't have as deep nesting of the code, which makes things hard to read unless functions are extracted which may otherwise not have needed to be extracted.

Of course, if callbacks are provided then no Promise is returned and code behaves as it currently does, making this change backwards-compatible.

Node v7, which will be released soon, will make native async/await available behind flags. And Node v8 will make them available without any flags. This feature would give the option to write simpler code for anyone using those versions of Node, or any version of Node with Babel (as I'm currently doing it).

CPU usage

Hi!

It seems that at last I found an unzip library that doesn't leak memory, so thank you.

Having said that, I'm observing that yauzl vs unzip command-line utility takes twice the amount of CPU time to decompress an identical ZIP. The code used is the one provided in README by you.

The ZIP file is public: https://downloads.mariadb.org/f/mariadb-10.1.16/winx64-packages/mariadb-10.1.16-winx64.zip

While unzip takes 16 second of CPU, yauzl takes 33.

Is there any room for improvement?

Thank you,

Albert

Zip bomb prevention?

Hi,

How is one supposed to abort processing of zip entry / file while processing entries?

Some background: I want to prevent a zip bomb from hogging CPU/memory resources, and would like to check for actual, cumulative uncompressed size while uncompressing an entry. For that, I implemented my own Writable stream which raises an error (through callback) when it gets too much data. I then catch this error and currently I call .close() for the readStream I got in yauzl's entry callback.

However, this seems to trigger a bug in node's zlib implementation (I tried both 0.10.28 and 0.12.2) and aborts the execution:

Assertion failed: (ctx->mode_ != NONE && "already finalized"), function Write, file ../src/node_zlib.cc, line 147.
Abort trap: 6

While I theoretically could patch my way around this, I naturally wouldn't want to fork both zlib.js and your library. So can I abort the processing of an entry / entire zip file by some other way cleanly, without any excessive CPU or memory usage?

Full sample code available at https://github.com/timotm/node-zip-bomb

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.