Giter Site home page Giter Site logo

fs-make-simple / undup Goto Github PK

View Code? Open in Web Editor NEW

This project forked from radii/undup

0.0 2.0 0.0 47 KB

store less bytes thanks to backreferences

License: GNU General Public License v2.0

Makefile 0.43% Shell 2.06% M4 1.97% C 95.54%
tar archive deduplication

undup's Introduction

undup - compress files by consolidating duplicate data

undup tries to compress an input stream by watching for blocks that have
previously appeared.  It replaces the duplicated data with a backreference.
Integrity is ensured by validating a SHA256 across the entire stream at
reconstruction time.

undup is intended to be pipelined with a general-purpose compressor such as
gzip, bzip2, or xz.

USAGE
-----

tar cf - dir | undup | xz > dir.tar.undup.xz
xzcat dir.tar.undup.xz | undup -d | tar xv

SAMPLE RESULTS
--------------

% for r in 3.0 3.1 3.2 3.3-rc1; do
    git archive --format=tar --prefix=linux-$r/ v$r | tar -C /tmp/linuxes -xf -
done
% tar -C /tmp -cf linuxes.tar linuxes
% du -shc /tmp/linuxes/*
500M    /tmp/linuxes/linux-3.0
504M    /tmp/linuxes/linux-3.1
511M    /tmp/linuxes/linux-3.2
518M    /tmp/linuxes/linux-3.3-rc1
2.0G    total

File sizes:

1833635840   linuxes.tar
 937173504   linuxes.tar.undp
 404399664   linuxes.tar.gz
 316914845   linuxes.tar.bz2
 270460412   linuxes.tar.xz
 203023371   linuxes.tar.undp.gz
 167099750   linuxes.tar.lrz
 159673153   linuxes.tar.undp.bz2
 138929420   linuxes.tar.undp.xz


format   ratio    pipelined w/ undup
------   -----    ------------------
undp      1.95
gzip      4.53       9.03
bzip2     5.78      11.48
xz        6.78      13.19
lrzip    10.97

Timings for undup + compressors on Core i7 L 640 @ 2.13GHz (2.9 GHz Turbo)

First, we time the undup phase.  This consumes a significant amount
of memory (for undup 0.2, about 105 MB of RAM to store hashes for the
1.8 GB linuxes.tar) and can be pipelined, but to get the most
reproducible timing results, we've run each phase separately.

undup linuxes.tar 47.26s user 4.15s system 97% cpu 52.885 total

Second, we compare times for various compressors to compress
linuxes.tar.undp.

gzip   35.81s user 0.72s system 96% cpu 37.817 total
bzip2 117.79s user 0.45s system 99% cpu 1:58.66 total
xz    606.51s user 1.31s system 99% cpu 10:09.72 total

undup + bzip2 achieves an 11.48x compression ratio while consuming only 
165 seconds of CPU time; elapsed time for a pipeline is reasonably similar:

undup 59.64s user 3.93s system 32% cpu 3:14.76 total
bzip2 138.65s user 1.05s system 71% cpu 3:14.73 total

This compares favorably to lrzip 0.608, which achieves a 10.97x ratio after
consuming 913 seconds of CPU time (lrzip is multithreaded by default):

lrzip -v -w 10 linuxes.tar 913.08s user 14.99s system 298% cpu 5:10.78 total

undup's People

Contributors

radii avatar zvezdochiot avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.