Giter Site home page Giter Site logo

hashsplit-lib's Introduction

This is a library for splitting files into chunks which are fairly stable with respect to file changes, which 
allows those chunks to be used as the basis for efficient file syncronisation.


Latest build: 2.1.1
    http://milton.io/maven/hashsplit4j/hashsplit4j-lib/2.1.1/hashsplit4j-lib-2.1.1.jar

Maven:
The hashsplit4j jar is published to the milton.io maven repo. To use it add the 
dependency as normal, and add the milton repo..
     <dependencies>
        ...
        <dependency>
            <groupId>hashsplit4j</groupId>
            <artifactId>hashsplit4j-lib</artifactId>
            <version>2.1.1</version>
        </dependency>
    </dependencies>

    <repositories>
        <repository>
            <id>milton-repo</id>
            <url>http://milton.io/maven</url>
        </repository>
    </repositories>   

Efficiency:
By "efficient" I mean:
1. Efficient with respect to network bandwidth because only changed data needs to be syncronized
2. Efficient with respect to storage, since file versions can be persisted with only changes across versions being stored
3. Efficient with respect to CPU usage, because files can be easily reconstituted

This library is inspired by BUP - https://github.com/apenwarr/bup/blob/master/DESIGN

One major design choice, different to BUP, is that instead of implementing a tree of 
fanouts (as per the BUP design documentation) I've opted for a single level of chunk 
grouping. I think this makes it much simpler to parse a file, and also much simpler 
to use the result of the parse. I also think that a single level of grouping will 
be sufficient for minimising netwok traffic, provided we are more selective then 
BUP with grouping...

BUP looks for boundaries by matching the lowest 13 bits of the rolling checksum, 
and then groups those chunks when the lowest 17bits are set. This means that the 
grouping is on average every 15 chunks (4bits=15), which seems to be too granular. 
I think we would probably want to group sets of 256, ie matching 21 bits. Given 
that the chunks information ends up 0.25% of the total file size, this means that 
the chunk groups will be 1/256 * 0.25% = 0.001% of total file size, about 10k for 
a 1Gb file. So I don't think we need to be any more granular then that.

The main parsing class is Parser, which you use like this:
        InputStream in = Scratch.class.getResourceAsStream(fname);
        //InputStream in = Scratch.class.getResourceAsStream("test1.txt");
        MemoryHashStore hashStore = new MemoryHashStore();
        MemoryBlobStore blobStore = new MemoryBlobStore();
        Parser parser = new Parser();
        List<Long> megaCrcs = parser.parse(in, blobStore, hashStore);


Note that there are 2 abstraction layers for storing the parse result 
- HashStore: for storing the tree of blobs in groups which is the series of chunks
or blobs in the file
- BlobStore: for storing and retrieving the actual chunks/blobs

public interface HashStore {
    public void onChunk(long crc, int start, int finish, byte[] bytes);
    public void onFanout(long crc, List<Long> childCrcs);
    public List<Long> getCrcsFromFanout(Long fanoutCrc);
    public byte[] getBlob(Long crc);   
}

public interface BlobStore {    
    void setBlob(String hash, byte[] bytes);    
    byte[] getBlob(String hash);        
    boolean hasBlob(String hash);    
}


Files can be reconstituted with the Combiner class:
        Combiner combiner = new Combiner();
        MemoryHashStore hashStore = new MemoryHashStore();
        MemoryBlobStore blobStore = new MemoryBlobStore();
        ByteArrayOutputStream bout = new ByteArrayOutputStream();
        combiner.combine(megaCrcs, blobStore, hashStore, bout);

There is now a HttpBlobStore which can store blobs to a HTTP server. There is also
a module in the hashsplit4j project which provides such an HTTP server. Work is
currently under way (as of 1/1/2014) to implement efficient syncronisation between
these http servers (rsync is hell with terabytes of blobs)


hashsplit-lib's People

Contributors

dgks0n avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.