Giter Site home page Giter Site logo

Comments (10)

ruptotus avatar ruptotus commented on May 20, 2024 1

One of the main differences between zpaqfranz and zpaq is the existence of whole file-size checksums (in fact there are even bigger differences for new tar-like format, to be completed) In zpaq, there is no checksum (or hash) of a file, only SHA1 of its individual fragments https://encode.su/threads/3508-Compute-overall-SHA1-from-a-SHA1-series-of-fragments

Sadly this ensures that any SHA1 collisions (there are famous PDF files in this respect) are NOT intercepted by zpaq Short version: if you archive two files with a SHA1 collision, zpaq don't complain https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable/page2?highlight=sha1+collision I can make deeper explanations if you want

CRC-32 has one major difference from 'cryptographic' hashes (including MD5): it is computable in disordered and combined portions (aka: fragments) https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable?highlight=sha1+collision Again, you will find all the details on the development forum, or I can write here. The short version is that zpaqfranz calculates the CRC32 during the test phase (just the t command, test), with minimal performance impact, and compares it with the CRC32 calculated during the compression phase (in short, those derived from the read file). https://encode.su/threads/3543-How-to-quickly-compute-CRC-32-of-an-all-zeroed-buffer You really cannot do this "thing" with deduplication on (that's why there is the w command) with a different hasher (MD5 or whatever). Simply it is impossible AND you cannot compute with multithread, but only in monotonic single-thread run (this is the p command)

The net result is that SHA1 collisions are detected by zpaqfranz (not correct, detected). "hidden" changes inside data will be detected too (ex. archiving a in-use file that "someone" will change, like a running VM) This is because zpaq(franz) tries to archive practically everything it can, whether it is in use or not (after all, it is a software designed for backup, unlike other compressors). Obviously the probability is small, but it still exists Example

P:\vm>zpaqfranz t sift.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
sift.zpaq:
1 versions, 9 files, 206.441 frags, 919 blks, 5.274.750.028 bytes (4.91 GB)
To be checked 17.286.397.757 in 8 files (32 threads)
7.15 stage time      21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking            18.315 blocks with CRC-32 (16.713.549.149 not-0 bytes)
ERROR:  STORED CRC-32 2379B9B9 != DECOMPRESSED B67F2325 (ck 00008946) sift/SIFT-Workstation-disk1.vmdk

CRC-32 time           0.45s
Blocks      16.713.549.149 (      18.315)
Zeros             negative (       2.352) 0.156000 s
Total       14.439.568.073 speed 31.735.314.446/sec (29.56 GB/s)
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
--------------------------------------------------------------------------------------
GOOD            : 00000007 of 00000008 (stored=decompressed)
WITH ERRORS

21.875 seconds (000:00:21) (with warnings)

7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any) This is the first, the zpaq-based stage test After that zpaqfranz's kicks in (if any) In the above example the archive is good (it is extractable) but the archived data is somewhat different

In this case everything is OK

P:\backup>zpaqfranz t www.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
www.zpaq:
1 versions, 2.731 files, 106.338 frags, 523 blks, 6.966.672.085 bytes (6.49 GB)
To be checked 7.378.295.954 in 2.461 files (32 threads)
7.15 stage time      25.17 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking             3.487 blocks with CRC-32 (7.378.213.268 not-0 bytes)
Block 00002K          6.11 GB
CRC-32 time           0.05s
Blocks       7.378.213.268 (       3.487)
Zeros               82.686 (           1) 0.000000 s
Total        7.378.295.954 speed 153.714.499.041/sec (143.16 GB/s)
GOOD            : 00002461 of 00002461 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

Taking CRC32 too slows down the archiving stage, and make a bigger archive It is possible to turn off, using "straight" zpaq-style archive with -nochecksum

Since data reliability is more important to me, I use it as the default https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable So, by default, you get THREE different test

1. SHA-1 of fragments

2. CRC-32 of the all file

3. XXHASH64 of all file (or different one, using for example -blake3 or whatever)

PS -pakka change only the output, it is an interface for Windows' GUI. Essentially writes less information

Hello,

I have a question about "t" command. Is there some bug or I should worry about may data?

My use case.

On Windows Server 2019 I have DB2 database. I do dump daily and store it in zpaq file using just "a" command without any switches. On that machine I use version "zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)".

When I test on that server all is OK.

PS C:\instalki\zpaq715> .\zpaqfranz.exe t C:\KOPIE\backup.zpaq
zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)
C:/KOPIE/backup.zpaq:
15 versions, 15 files, 608.189 frags, 2.948 blks, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (24 threads)
7.15 stage time     353.92 no error detected (RAM ~385.55 MB), try CRC-32 (if any)
Checking           512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)

CRC-32 time          49.52s
Blocks     429.318.293.888 (     512.717)
Zeros       12.970.456.704 (      22.497) 6.616000 s
Total      442.288.750.592 speed 8.929.173.492/sec (8.32 GB/s)
GOOD            : 00000015 of 00000015 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

403.656 seconds (000:06:43) (all OK)

Then I transfer archive to other computer (I use filezilla resume option to download only new data).

Other computer is Windows 10 with zpaqfranz version "zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)"

And test results are:

PS K:\dir> zpaqfranz.exe t .\backup.zpaq
zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)
./backup.zpaq:
15 versions, 15 files, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (4 threads)
7.15 stage time     202.50 no error detected (RAM ~64.26 MB), try CRC-32 (if any)
Checking           512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)
ERROR:  STORED CRC-32 77532235 != DECOMPRESSED C9E7F911 (ck 00019254) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230324223027.001
ERROR:  STORED CRC-32 E3A7A2B1 != DECOMPRESSED B976C123 (ck 00021111) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230331223032.001
ERROR:  STORED CRC-32 9AEAF621 != DECOMPRESSED 9849CEC6 (ck 00022809) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230407223028.001
ERROR:  STORED CRC-32 55371556 != DECOMPRESSED DFE0B566 (ck 00024730) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230416223027.001
ERROR:  STORED CRC-32 A09E6044 != DECOMPRESSED BB230EE2 (ck 00036567) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230506223028.001
ERROR:  STORED CRC-32 A41EC1BA != DECOMPRESSED 93C71B3A (ck 00039680) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230516223032.001
ERROR:  STORED CRC-32 49EA704B != DECOMPRESSED 33E65072 (ck 00040869) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230521223034.001
ERROR:  STORED CRC-32 DDA93190 != DECOMPRESSED 3DD784C3 (ck 00041314) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230526223034.001
ERROR:  STORED CRC-32 0B35FBDC != DECOMPRESSED 2DD981BB (ck 00040476) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230604223041.001
ERROR:  STORED CRC-32 CB5A36E3 != DECOMPRESSED 2480F3C8 (ck 00042183) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230609224237.001
ERROR:  STORED CRC-32 0082630B != DECOMPRESSED C2396AE7 (ck 00042683) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230615223050.001
ERROR:  STORED CRC-32 E80A5C1D != DECOMPRESSED 2F7E5B91 (ck 00042703) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230701223027.001
ERROR:  STORED CRC-32 AC3369F0 != DECOMPRESSED 47B5DD58 (ck 00043742) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230709223032.001
ERROR:  STORED CRC-32 3DFB85F8 != DECOMPRESSED D0F06B3B (ck 00010772) C:/kopie/DATABASE.0.DB2.DBPART000.20230320223031.001
ERROR:  STORED CRC-32 2984D169 != DECOMPRESSED 8A141274 (ck 00043824) Z:/kopie/DATABASE.0.DB2.DBPART000.20230616233029.001

CRC-32 time         104.83s
Blocks     429.318.293.888 (     512.717)
Zeros             negative (      72.961) 72.663000 s
Total      150.007.756.078 speed 1.430.975.742/sec (1.33 GB/s)
ERRORS        : 00000015 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WITH ERRORS

307.406 seconds (000:05:07) (with warnings)

But when I extract one file for example "C:/KOPIE/DATABASE.0.DB2.DBPART000.20230709223032.001", and get crc32 hash manually with zpaqfranz I get good stored checksum

PS C:\Users\user> zpaqfranz.exe sum D:\test\DATABASE.0.DB2.DBPART000.20230709223032.001 -crc32
zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)
franz:sum                                       1 - command
franz:-crc32
Getting CRC-32 ignoring .zfs and :$DATA

No multithread: Found (28.98 GB) => 31.112.585.216 bytes (28.98 GB) / 1 files in 0.015000
|CRC-32: AC3369F0 [     31.112.585.216]     |D:/test/DATABASE.0.DB2.DBPART000.20230709223032.001

214.296 seconds (000:03:34) (all OK)

Extracted hash AC3369F0 is equal with stored hash from "t" command

Also I calculated SHA256 checksum of extracted dump file and original file on the server and they are the same. So can I believe that stored file in zpaq archive is good?

PS1. During writing this comment I also downloaded zpaqfranz exe version from server and test is good:

PS C:\Users\user\Desktop> .\zpaqfranz.exe t K:\dir\backup.zpaq
zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)
K:/dir/backup.zpaq:
15 versions, 15 files, 608.189 frags, 2.948 blks, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (4 threads)
7.15 stage time     199.73 no error detected (RAM ~64.26 MB), try CRC-32 (if any)
Checking           512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)

CRC-32 time          32.25s
Blocks     429.318.293.888 (     512.717)
Zeros       12.970.456.704 (      22.497) 4.292000 s
Total      442.288.750.592 speed 13.707.154.386/sec (12.77 GB/s)
GOOD            : 00000015 of 00000015 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

232.032 seconds (000:03:52) (all OK)

So maybe there is some bug in "t" command on newer versions? Or incompatibility in archive format?

PS2. Also thank You for fantastic job continuing developing zpaq. I used original zpaq for years and it was a nice found that someone continue the job ^_^

from zpaqfranz.

fcorbelli avatar fcorbelli commented on May 20, 2024

One of the main differences between zpaqfranz and zpaq is the existence of whole file-size checksums (in fact there are even bigger differences for new tar-like format, to be completed)
In zpaq, there is no checksum (or hash) of a file, only SHA1 of its individual fragments
https://encode.su/threads/3508-Compute-overall-SHA1-from-a-SHA1-series-of-fragments

Sadly this ensures that any SHA1 collisions (there are famous PDF files in this respect) are NOT intercepted by zpaq
Short version: if you archive two files with a SHA1 collision, zpaq don't complain
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable/page2?highlight=sha1+collision
I can make deeper explanations if you want

CRC-32 has one major difference from 'cryptographic' hashes (including MD5): it is computable in disordered and combined portions (aka: fragments)
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable?highlight=sha1+collision
Again, you will find all the details on the development forum, or I can write here.
The short version is that zpaqfranz calculates the CRC32 during the test phase (just the t command, test), with minimal performance impact, and compares it with the CRC32 calculated during the compression phase (in short, those derived from the read file).
https://encode.su/threads/3543-How-to-quickly-compute-CRC-32-of-an-all-zeroed-buffer
You really cannot do this "thing" with deduplication on (that's why there is the w command) with a different hasher (MD5 or whatever).
Simply it is impossible AND you cannot compute with multithread, but only in monotonic single-thread run (this is the p command)

The net result is that SHA1 collisions are detected by zpaqfranz (not correct, detected).
"hidden" changes inside data will be detected too (ex. archiving a in-use file that "someone" will change, like a running VM)
This is because zpaq(franz) tries to archive practically everything it can, whether it is in use or not (after all, it is a software designed for backup, unlike other compressors).
Obviously the probability is small, but it still exists
Example

P:\vm>zpaqfranz t sift.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
sift.zpaq:
1 versions, 9 files, 206.441 frags, 919 blks, 5.274.750.028 bytes (4.91 GB)
To be checked 17.286.397.757 in 8 files (32 threads)
7.15 stage time      21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking            18.315 blocks with CRC-32 (16.713.549.149 not-0 bytes)
ERROR:  STORED CRC-32 2379B9B9 != DECOMPRESSED B67F2325 (ck 00008946) sift/SIFT-Workstation-disk1.vmdk

CRC-32 time           0.45s
Blocks      16.713.549.149 (      18.315)
Zeros             negative (       2.352) 0.156000 s
Total       14.439.568.073 speed 31.735.314.446/sec (29.56 GB/s)
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
--------------------------------------------------------------------------------------
GOOD            : 00000007 of 00000008 (stored=decompressed)
WITH ERRORS

21.875 seconds (000:00:21) (with warnings)

7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
This is the first, the zpaq-based stage test
After that zpaqfranz's kicks in (if any)
In the above example the archive is good (it is extractable) but the archived data is somewhat different

In this case everything is OK

P:\backup>zpaqfranz t www.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
www.zpaq:
1 versions, 2.731 files, 106.338 frags, 523 blks, 6.966.672.085 bytes (6.49 GB)
To be checked 7.378.295.954 in 2.461 files (32 threads)
7.15 stage time      25.17 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking             3.487 blocks with CRC-32 (7.378.213.268 not-0 bytes)
Block 00002K          6.11 GB
CRC-32 time           0.05s
Blocks       7.378.213.268 (       3.487)
Zeros               82.686 (           1) 0.000000 s
Total        7.378.295.954 speed 153.714.499.041/sec (143.16 GB/s)
GOOD            : 00002461 of 00002461 (stored=decompressed)
VERDICT         : OK                   (CRC-32 stored vs decompressed)

Taking CRC32 too slows down the archiving stage, and make a bigger archive
It is possible to turn off, using "straight" zpaq-style archive with -nochecksum

Since data reliability is more important to me, I use it as the default
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable
So, by default, you get THREE different test

  1. SHA-1 of fragments
  2. CRC-32 of the all file
  3. XXHASH64 of all file (or different one, using for example -blake3 or whatever)

PS -pakka change only the output, it is an interface for Windows' GUI. Essentially writes less information

from zpaqfranz.

fcorbelli avatar fcorbelli commented on May 20, 2024

PS now it is 01:35, later I will fix and explain better, time to... bed :)

from zpaqfranz.

fcorbelli avatar fcorbelli commented on May 20, 2024

You can find here, as the very first difference

https://github.com/fcorbelli/zpaqfranz/wiki/Diff-against-7.15:-add

from zpaqfranz.

todd-a-jacobs avatar todd-a-jacobs commented on May 20, 2024

BTW, I found that Apple Silicon (notably the M2 processor) seems to be hardware-optimized for SHA-256. When I ran the zpaqfranz benchmarks, even against a terabyte or two, SHA-256 performed in your benchmark at about the same speed as XXHASH3. It might make sense to check for hardware acceleration and use SHA-256 as a default instead of XXHASH3 when the performance is going to be roughly the same since SHA-256 is cryptographically strong while the various XXHASH algorithms don't have any cryptographic properties at all.

Since I don't know how to the benchmarks are done, this may not actually be representative of real-world speeds. Still, it's at least worth thinking about since a number of other platforms also now include some form of AES hardware to speed up AES cipher operations.

from zpaqfranz.

fcorbelli avatar fcorbelli commented on May 20, 2024

You can see "under the hood" with a

zpaqfranz b -debug

(...)
Free RAM seems 43.218.018.304
1838: new ecx 2130194955
1843: new ebx 563910569
SSSE3 :OK
SSE41 :OK
SHA   :OK
SHA1/2 seems supported by CPU

You need 3 "OK" to "automagically" get HW acceleration.
Rare, very rare, with Intel CPUs.
Sadly I do not like Macs very much (almost always... I use the terminal just like a FreeBSD box :)

The benchmark is very, very rude, just a quick check to get some infos on VPS' CPUs
I see your point, but the default is XXHASH (64 bit), not XXH3 (128 bit) to do not "cook off" 32 bit CPUs (not every silicon is 64 bit)
With zpaqfranz you can choose... whatever you like (almost everywhere, some exception for md5)

from zpaqfranz.

fcorbelli avatar fcorbelli commented on May 20, 2024

PS this is a "real world" example of a Intel-based server, with proxmox+FreeBSD VM, running on HDD

root@franco:/home/mog1/copie # zpaqfranz versum "./*.zpaq" -checktxt
zpaqfranz v58.4q-JIT-L,HW SHA1/2,(2023-06-22)
franz:versum                                    | - command
franz:-checktxt
66265: Test MD5 hashes of .zpaq against _md5.txt
66136: Searching for jolly archive(s) in <<./*.zpaq>> for extension <<zpaq>>
66288: Bytes to be checked 250.899.885.678 (233.67 GB) in files 2
009% 00:29:23 (  22.19 GB) of ( 233.67 GB)          128.799.140/SeC

As you can see the "real" bandwidth of the drive is about 128MB/s, even a 10GB/s hasher will gain nothing

from zpaqfranz.

fcorbelli avatar fcorbelli commented on May 20, 2024

It is a known bug, for file size (in decimal) longer than 10 chars
You can get the latest nightly build from http://www.francocorbelli.it/zpaqfranz with the bug fixed and the new -fasttxt magic computation of full archive CRC-32

from zpaqfranz.

fcorbelli avatar fcorbelli commented on May 20, 2024

https://github.com/fcorbelli/zpaqfranz/releases/tag/58.5

from zpaqfranz.

ruptotus avatar ruptotus commented on May 20, 2024

Thank You for quick response and... fixing release already ^_^

from zpaqfranz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.