ulikunitz / xz Goto Github PK

Pure golang package for reading and writing xz-compressed files

License: Other

Go 99.99% Shell 0.01%

go compression decompression xz

xz's Introduction

Package xz

This Go language package supports the reading and writing of xz compressed streams. It includes also a gxz command for compressing and decompressing data. The package is completely written in Go and doesn't have any dependency on any C code.

The package is currently under development. There might be bugs and APIs are not considered stable. At this time the package cannot compete with the xz tool regarding compression speed and size. The algorithms there have been developed over a long time and are highly optimized. However there are a number of improvements planned and I'm very optimistic about parallel compression and decompression. Stay tuned!

Using the API

The following example program shows how to use the API.

package main

import (
    "bytes"
    "io"
    "log"
    "os"

    "github.com/ulikunitz/xz"
)

func main() {
    const text = "The quick brown fox jumps over the lazy dog.\n"
    var buf bytes.Buffer
    // compress text
    w, err := xz.NewWriter(&buf)
    if err != nil {
        log.Fatalf("xz.NewWriter error %s", err)
    }
    if _, err := io.WriteString(w, text); err != nil {
        log.Fatalf("WriteString error %s", err)
    }
    if err := w.Close(); err != nil {
        log.Fatalf("w.Close error %s", err)
    }
    // decompress buffer and write output to stdout
    r, err := xz.NewReader(&buf)
    if err != nil {
        log.Fatalf("NewReader error %s", err)
    }
    if _, err = io.Copy(os.Stdout, r); err != nil {
        log.Fatalf("io.Copy error %s", err)
    }
}

Documentation

You can find the full documentation at pkg.go.dev.

Using the gxz compression tool

The package includes a gxz command line utility for compression and decompression.

Use following command for installation:

$ go get github.com/ulikunitz/xz/cmd/gxz

To test it call the following command.

$ gxz bigfile

After some time a much smaller file bigfile.xz will replace bigfile. To decompress it use the following command.

$ gxz -d bigfile.xz

Security & Vulnerabilities

The security policy is documented in SECURITY.md.

The software is not affected by the supply chain attack on the original xz implementation, CVE-2024-3094. This implementation doesn't share any files with the original xz implementation and no patches or pull requests are accepted without a review.

All security advisories for this project are published under github.com/ulikunitz/xz/security/advisories.

xz's People

Contributors

Stargazers

Watchers

Forkers

pmezard dotwaffle kokizzu gy-games-libs ondrajz happy-ferret firebitsbr b4rti rjoleary henesy cnsuhao trafficstars yi-ge shammishailaj isgasho lgmontenegro jdgcs apapsch blacktop buengese frrad machinechatllc disconnect3d tonywubo 76428778fada jubalh happyhakka legret-lin pedroalbanese padorax petrhosek fiskfan1999 baptistapedro lucasepe iq-scm lschyi orisano ag-go nigeltao ccssmnn apostasie

xz's Issues

Current maturity of project (and other semantics)

As mentioned in #1, this project was declared as "not even alpha" in 2015. What is the current maturity of this project?

As mentioned in #23, this project was considered to be not near the speed of XZ back in 2018. Has the speed improved since then? Were you referring to general code optimization or part of your LZMA implementation that needed to be reimplemented according to the specification in order to meet the standard speed expectations?

As mentioned in #26, this project implements a different LZMA algorithm than XZ. Would you be able to provide the names/URLs of your algorithm versus that algorithm?

use of internal package github.com/ulikunitz/xz/internal/xlog not allowed

I'm trying to build my program that uses your library and getting this error:

vendor/github.com/ulikunitz/xz/reader.go:17:2: use of internal package github.com/ulikunitz/xz/internal/xlog not allowed

The dependency:

    "github.com/ulikunitz/xz"

I added it to vendor/ using govendor add.

Should the block padding size validated?

Thanks to Dórian C. Langbeck I realized that I confused the padding for the block header with the padding of the block in the discussion of issue #15 . Currently we don't check the block padding size. Whether we should check it is an open question.

3.3. Block Padding

        Block Padding MUST contain 0-3 null bytes to make the size of
        the Block a multiple of four bytes. This can be needed when
        the size of Compressed Data is not a multiple of four. If any
        of the bytes in Block Padding are not null bytes, the decoder
        MUST indicate an error.

Panic in lzma.writeRep

@pmezard reported a panic in the master tree that he has found using go-fuzz. Many thanks for that. I have asked for the go-fuzz code and the crasher sequence to check what caused the bug and to fix it in the dev tree.

reader.go:35: constant 4294967295 overflows int

# github.com/ulikunitz/xz/lzma
..\github.com\ulikunitz\xz\lzma\reader.go:35: constant 4294967295 overflows int
..\github.com\ulikunitz\xz\lzma\reader2.go:31: constant 4294967295 overflows int
..\github.com\ulikunitz\xz\lzma\writer.go:78: constant 4294967295 overflows int
..\github.com\ulikunitz\xz\lzma\writer2.go:55: constant 4294967295 overflows int

go 1.6.3
windows xp

Checksum None is valid

I'm trying to decompress iOS OTA files and they are failing your flags check. They appear to have "None" checksum.

https://tukaani.org/xz/xz-file-format-1.0.4.txt

2.1.1.2. Stream Flags

        The first byte of Stream Flags is always a null byte. In the
        future, this byte may be used to indicate a new Stream version
        or other Stream properties.

        The second byte of Stream Flags is a bit field:

            Bit(s)  Mask  Description
             0-3    0x0F  Type of Check (see Section 3.4):
                              ID    Size      Check name
                              0x00   0 bytes  None
                              0x01   4 bytes  CRC32
                              0x02   4 bytes  (Reserved)
                              0x03   4 bytes  (Reserved)
                              0x04   8 bytes  CRC64
                              0x05   8 bytes  (Reserved)
                              0x06   8 bytes  (Reserved)
                              0x07  16 bytes  (Reserved)
                              0x08  16 bytes  (Reserved)
                              0x09  16 bytes  (Reserved)
                              0x0A  32 bytes  SHA-256
                              0x0B  32 bytes  (Reserved)
                              0x0C  32 bytes  (Reserved)
                              0x0D  64 bytes  (Reserved)
                              0x0E  64 bytes  (Reserved)
                              0x0F  64 bytes  (Reserved)
             4-7    0xF0  Reserved for future use; MUST be zero for now.

Thank you for a great pkg!

Don't worry about this lssue

解决了go编译器下载依赖的问题后，又出现了一个新的问题。
编译时，它似乎报错了，在src目录下也并没有出现arozos。
我不太了解go编译器，因此我没有办法自己排查错误，所以这个问题在您眼里可能是个很傻的问题。
但我带着不懂就问的心态，还是新建了一个问题，希望能快速解决。

How do you decompress tar.xz files?

Trying to discover how to extract tar.xz files. Is this possible?

How to write compress & decompress data in file?

I want to compress xz type, but I don't know how to write to file.

Could you give me answer this question?

And how to decompress from xz file?

Thank you so much~^^

Add an --x86 flag

The GNU xz command has a boolean --x86 flag which allegedly gives 0-15% extra compression when applied to files containing x86 machine code. The Linux kernel uses this flag when compressing its bzImage. It is also popular among UEFI implementations when compressing firmware.

The explanation from the xz man page is:

          A  BCJ filter converts relative addresses in the machine code to
          their absolute counterparts.  This doesn't change  the  size  of
          the  data,  but it increases redundancy, which can help LZMA2 to
          produce 0-15 % smaller .xz file.  The  BCJ  filters  are  always
          reversible, so using a BCJ filter for wrong type of data doesn't
          cause any data loss, although it may make the compression  ratio
          slightly worse.

We have already adapted this feature to Go. Would it be suitable to upstream it with an --x86 flag?

Plan for rewrite branch

Is https://github.com/ulikunitz/xz/tree/rewrite production ready? When do you anticipate this being promoted to main?

Thanks

Expose `processFile` function

It would be helpful (for my use case) to export a function that performs the same behavior as the gxz command line utility so that it can be called programmatically. This would de-dupe the file read/ write logic that has already been written for the command line utility, allowing for more programmatic usage of this library.

Would this be a desirable addition?

This was attempted, incorrectly, in #48.

How to use multi CPU work for compression?

The new version of XZ utils supports multi CPU simultaneous compression, which will greatly improve the efficiency. How to implement it,thanks!

Out of Memory bug when using a large reader

This had me scratching my head today for a solution-

I am decompressing many files, however a few wouldn't work because they are large and I would get out of memory panics. I looked at my code and saw I was reading the full file into a byte slice which is expensive memory-wise. I rewrote the code to use io.Readers of sorts. Now, even with that I still get out of memory panics. Looking at the source code, the issue lies with ReaderConfig.NewReader calling newDecoderDict, which calls newBuffer which makes a byte slice of the buffer size, which is exactly what I had fixed in my own code. Now I am asking, is it possible to remedy this, and if so, could we?

Thanks,

[SECURITY] Implementation of readUvarint vulnerable to CVE-2020-16845

Implementation of readUvarint at https://github.com/ulikunitz/xz/blob/master/bits.go#L56 is very similar to the vulnerable code in the Golang encoding/binary library and seems to suffer from the same vulnerability described in golang/go#40618.

See the fix at https://go-review.googlesource.com/c/go/+/247120/2/src/encoding/binary/varint.go

Note: I couldn't find any information on how to disclose this issue to the maintainers. I would also suggest setting up a Security Policy for the project within GitHub

Not clear how to trade most resources for most compression

The documentation of WriterConfig is somewhat sparse. I would like to emulate ( in spirit, I understand the algorithm is not perfect ) the result of xz --lzma2=preset=9,dict=128MiB

Could you please point me to a "starting point" ?

Thanks!

expose blockreader

I'm working on a library to do random access in xz files with multiple blocks. I would love to use your library to do the heavy lifting instead of reinventing the wheel. I need to use some of the internal pieces though, including the blockReader

xz/reader.go

Line 261 in 067145b

type blockReader struct {

and related parts.

Would you be open to a PR which refactors things so your original package keeps the same public interface but uses a new ulikunitz/xz/lib/xzinternals which makes public some of these currently private structs?

low compression ratio

$ ll
-rw-r--r--  1 jpillora  wheel   550K 30 Jan 21:18 a.log
$ cp a.log b.log
$ cp a.log c.log
$ xz a.log
$ gxz b.log
$ gzip c.log
$ ll
-rw-r--r--  1 jpillora  wheel   6.0K 30 Jan 15:43 a.log.xz
-rw-r--r--  1 jpillora  wheel   207K 30 Jan 21:16 b.log.xz
-rw-r--r--  1 jpillora  wheel    10K 30 Jan 21:16 c.log.gz

Any idea why this is?

multicore

Cool project!

Obvious enhancement for the future would be to parallelize some work, making use of all cpu cores.

Add statement to README about xz backdoor

IMO It's clear that this is not affected by CVE-2024-3094, but that is bouncing all over the internet, and it would be great to make a statement in the README why it's not affected.

Compressing and decompressing empty string fails

Following program:

package main

import (
	"bytes"
	"io"
	"log"
	"os"

	"github.com/ulikunitz/xz"
)

func main() {
	const text = ""
	var buf bytes.Buffer
	// compress text
	w, err := xz.NewWriter(&buf)
	if err != nil {
		log.Fatalf("xz.NewWriter error %s", err)
	}
	if _, err := io.WriteString(w, text); err != nil {
		log.Fatalf("WriteString error %s", err)
	}
	if err := w.Close(); err != nil {
		log.Fatalf("w.Close error %s", err)
	}
	// decompress buffer and write output to stdout
	r, err := xz.NewReader(&buf)
	if err != nil {
		log.Fatalf("NewReader error %s", err)
	}
	if _, err = io.Copy(os.Stdout, r); err != nil {
		log.Fatalf("io.Copy error %s", err)
	}
}

when executed fails:

$ go run xztest.go
2017/02/15 10:55:38 io.Copy error xz: invalid header magic bytes
exit status 1

I would expect that it should correctly handle empty string.

Valid Concern to not use xz ?

While researching xz as a storage medium I came across this article.

https://www.nongnu.org/lzip/xz_inadequate.html

Is what they've outlined on this page a valid concern and we shouldn't be using xz?

Using this in archiver utility

Hi, this isn't an issue (so you can close this) -- I just wanted to thank you for this package. I've added .tar.xz support to archiver after a request for it.

I know you've said in another issue that this is "pre-alpha" work, and that's fine, but just so you know it's being used now. 👍

Panic with invalid input

Found in fuzz test.

how to reproduce

package xz_test

import (
	"bytes"
	"io/ioutil"
	"testing"

	"github.com/ulikunitz/xz"
)

func TestPanic(t *testing.T) {
	data := []byte([]uint8{253, 55, 122, 88, 90, 0, 0, 0, 255, 18, 217, 65, 0, 189, 191, 239, 189, 191, 239, 48})
	t.Log(string(data))
	r, err := xz.NewReader(bytes.NewReader(data))
	if err != nil {
		t.Skip("OK")
	}
	b, err := ioutil.ReadAll(r)
	if err != nil {
		t.Skip("OK")
	}
	t.Log(b)
}

$ go test -run "TestPanic" -v

=== RUN   TestPanic
    panic_test.go:13: 7zXZAｿ0
--- FAIL: TestPanic (0.00s)
panic: runtime error: makeslice: len out of range [recovered]
        panic: runtime error: makeslice: len out of range [recovered]
        panic: runtime error: makeslice: len out of range

goroutine 6 [running]:
testing.tRunner.func1.1(0x54ef00, 0x5b1c00)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/testing/testing.go:1072 +0x30d
testing.tRunner.func1(0xc000001380)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/testing/testing.go:1075 +0x41a
panic(0x54ef00, 0x5b1c00)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/runtime/panic.go:969 +0x1b9
io/ioutil.readAll.func1(0xc000095f28)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/io/ioutil/ioutil.go:30 +0x106
panic(0x54ef00, 0x5b1c00)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/runtime/panic.go:969 +0x1b9
github.com/ulikunitz/xz.readIndexBody(0x5b40a0, 0xc00008c3c0, 0x100, 0xc000095bc0, 0x40df58, 0x20, 0x557560, 0x1)
        /home/heijo/ghq/github.com/ulikunitz/xz/format.go:684 +0x1d4
github.com/ulikunitz/xz.(*streamReader).readTail(0xc00008a1e0, 0xc000074490, 0xc000074490)
        /home/heijo/ghq/github.com/ulikunitz/xz/reader.go:163 +0x50
github.com/ulikunitz/xz.(*streamReader).Read(0xc00008a1e0, 0xc000244000, 0x200, 0x200, 0xc000095dd0, 0x40b125, 0xc000095dd8)
        /home/heijo/ghq/github.com/ulikunitz/xz/reader.go:209 +0x4f9
github.com/ulikunitz/xz.(*Reader).Read(0xc00008c3f0, 0xc000244000, 0x200, 0x200, 0xc000244000, 0x0, 0x0)
        /home/heijo/ghq/github.com/ulikunitz/xz/reader.go:112 +0xe5
bytes.(*Buffer).ReadFrom(0xc00006feb0, 0x5b4120, 0xc00008c3f0, 0x0, 0xc00008c300, 0x5b40a0)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/bytes/buffer.go:204 +0xb1
io/ioutil.readAll(0x5b4120, 0xc00008c3f0, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/io/ioutil/ioutil.go:36 +0xe5
io/ioutil.ReadAll(...)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/io/ioutil/ioutil.go:45
github.com/ulikunitz/xz_test.TestPanic(0xc000001380)
        /home/heijo/ghq/github.com/ulikunitz/xz/panic_test.go:18 +0x185
testing.tRunner(0xc000001380, 0x58fab0)
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
        /home/linuxbrew/.linuxbrew/Cellar/go/1.15.7/libexec/src/testing/testing.go:1168 +0x2b3
exit status 2
FAIL    github.com/ulikunitz/xz 0.005s

How i can compressed folder?

I'm trying to figure out how to compress a folder but haven't found a solution for that yet

Memory not released?

First, many thanks for that library, I'm using it in a backup application. Works great!

But I'm not sure whether the following behavior is a bug:
I've developed a backup application which runs as a daemon and compresses files on a daily schedule. While the application is in an idle state, it consumes ~ 6MB RAM. When the daemon executes some tasks (other than xz compression), the memory consumption grows as long as the task is executed. When such a task has completed, the memory usage normalizes back to ~6MB.

But now, when the task starts which compresses files, the daemon consumes ~ 110MB memory which is absolutely ok, but this amount of memory is not released after the task has completed. So, after execution of the compression task the daemon process utilizes still that amount of memory. This memory usage remains the same even after multiple executions of the 'compression' task. Is this a normal behavior or am I doing something wrong?

Here is the code snippet which is responsible for the compression:

	destinationFile, err := os.OpenFile(destinationFilePath, os.O_CREATE|os.O_WRONLY, 0644)
	if err != nil {
		return err
	}
	defer destinationFile.Close()

	xzWriter, err := xz.NewWriter(destinationFile)
	if err != nil {
		return err
	}
	defer xzWriter.Close()

	if _, err := io.Copy(xzWriter, sourceFile); err != nil {
		return err
	}

	return destinationFile.Sync()

What do you think?

Missing common APIs like Reader:Close() Writer:Flush()

Greetings everyone,

I'm a little fresh with golang, so hope I am looking at the right place.

Going through the API - it seems like Reader:Close and Writer:Flush are missing.
These are fairly common - as you can see in the the golang standard library zlib and zgip.

Is it possible to have these in future release?
Thanks in advance

memory leak

There is a memory leak. When I decompressed all the kernel modules for analysis, the VmRSS occupancy increased to 150M

Achieving maximum xz compression

If I am using the following code, how do I set the options for maximum compression (the -9 preset level)?

// target is of type io.Writer
 xzw, _ := xz.NewWriter(target)

Cannot decompress some archives

Some xz archives fail part way during decompression. Quite a few of the Linux kernel releases fall into this category.

You can reproduce it via:

[djm@demiurge ~]$ wget -q https://www.kernel.org/pub/linux/kernel/v3.0/linux-3.13.tar.xz                                                                    
[djm@demiurge ~]$ xzcat linux-3.13.tar.xz | wc -c
549816320
[djm@demiurge ~]$ gxz -dc linux-3.13.tar.xz | wc -c
201330688
[djm@demiurge ~]$ echo $?

Note that the truncation is silent - no error is written to stderr and the exit status is 0. The problem isn't in cmd/xz - I noticed it first using the library directly.

Working/Stable?

Extract files from NSIS installer?

Hi,

I wondered if I can use your package to extract files from an NSIS installer .exe? It looks like it uses LZMA compression with an offset of 4 EF BE AD DE N u l l s o f t I n s t according to the command 7z i. Thanks for any help with this.

Unzipping is too slow

When i tried to unzip big file (about 3 GiB size in xz and about 18 GiB unpacked) the process was too slow - only 3 GiB of 18 unpacked in about 40 min on my machine. The same file was unpacked for about 5 minutes using 7 zip tool

-f option: CTRL-C during compression or decompression removes target file

The -f option allows overwriting of the target file. Currently the target file is removed before the compression has been completed. So interrupting the process with CTRL-C removes the target file, which is unexpected.

limit reached error

@ulikunitz - we are sporadically getting a limit reached error and wondering what are the possible reasons that this might be happening or if there is a way to increase initial setting of the N:? Maybe something we can do with the props or writerConfig to increase this limit? The other weird thing is that this only seems to happens on *.tar.xz files.

gxz.exe: Terminal detection in git-bash on Windows x86-64

The gxz.exe doesn't detect the terminal in git-bash on Windows x86-64. Terminal detection works in the PowerShell. This is probably a wontfix, but I want to record it.

LZMA2 reader issue

I am having trouble decoding a file.
I reproduced the issue with gxz after encountering it with my test case.

The error is:
lzma: Reader2 doesn't get data

xz/lzma/reader2.go

Line 149 in 25c16dc

r.err = errors.New("lzma: Reader2 doesn't get data")

I have attached the file that produces the issue.
gxz
gxz -d
gxz lzma: Reader2 doesn't get data

xz -d appears to decode properly so this would seem to point to a problem in reader2.
Will follow up with more info or a fix.

The offending file:
https://www.dropbox.com/s/gmoyva5lx5k96vs/utils.LegacyBloomFilter.bin?dl=0

missing match limit, was "lzip"

is it possible to work with lzip fles which are also known of using LZMA compress algorithm ?

Equivalent of FastBytes?

The 7zip SDK has something they call "FastBytes" which seems to be a mechanism to limit how much time is spent looking for the best sequence to add to the dictionary. I don't see an equivalent here, how do you get around that?

5-byte padding

I've a deb package from my local /var/cache/apt/archives. I'm using debian stretch and the package is accountsservice_0.6.43-1_amd64.deb.

If I execute the following commands, everything works (no warnings):

$ ar x accountsservice_0.6.43-1_amd64.deb data.tar.xz
$ xz -d data.tar.xz
$ ls
accountsservice_0.6.43-1_amd64.deb  data.tar
$ tar tf data.tar
./
./etc/
./etc/dbus-1/
./etc/dbus-1/system.d/
./etc/dbus-1/system.d/org.freedesktop.Accounts.conf
...

But if I run the following code (where t/data.tar.xz is the file extracted from the deb cited above), I got an error:

func main() {
	f, err := os.Open("t/data.tar.xz")
	if err != nil {
		panic(err)
	}

	r, err := xz.NewReader(f)
	if err != nil {
		panic(err)
	}

	tr := tar.NewReader(r)
	for {
		hdr, err := tr.Next()
		if err != nil {
			panic(err)
		}	

		log.Println(hdr)
	}
}

The execution output is (I'm using Gogland IDE, but the result is the same if I run from terminal):

GOROOT=/usr/lib/go-1.8
GOPATH=/home/langbeck/git/golang/unpackers/external:/home/langbeck/git/golang/unpackers:/home/langbeck/go
/usr/lib/go-1.8/bin/go build -i -o /tmp/maingo main
/tmp/maingo
panic: xz: unexpected padding size 5

goroutine 1 [running]:
main.main()
	/home/langbeck/git/golang/unpackers/src/main/main.go:127 +0x16c

Note 1: line 127 is the line of the 3rd panic()
Note 2: xz --test data.tar.xz && echo OK run fine and the same deb file is installed in my system, so it's a valid deb file.

I'm attaching the deb file in question: accountsservice_0.6.43-1_amd64.deb.zip

high allocation ratio

I have a processing scenario where I read lzma objects and I need to decompress them.
while using pprof I could see that the lzma reader allocates buffers for every message:
0 0% 0.0046% 433804685.70MB 96.50% github.com/ulikunitz/xz/lzma.NewReader (inline)
4122.21MB 0.00092% 0.0056% 433804685.70MB 96.50% github.com/ulikunitz/xz/lzma.ReaderConfig.NewReader
2414.61MB 0.00054% 0.0061% 432805222.15MB 96.28% github.com/ulikunitz/xz/lzma.newDecoderDict (inline)
432802807.54MB 96.28% 96.28% 432802807.54MB 96.28% github.com/ulikunitz/xz/lzma.newBuffer (inline)

can we add some option for allowing to have a pool of that buffer? or some other way to reuse a reader?