Giter Site home page Giter Site logo

hashsplit-lib's Introduction

This is a library for splitting files into chunks which are fairly stable with respect to file changes, which
allows those chunks to be used as the basis for efficient file syncronisation.


Latest build: 2.4.1
    http://milton.io/maven/hashsplit4j/hashsplit4j-lib/2.4.1/hashsplit4j-lib-2.4.1.jar

Maven:
The hashsplit4j jar is published to the milton.io maven repo. To use it add the
dependency as normal, and add the milton repo..
     <dependencies>
        ...
        <dependency>
            <groupId>hashsplit4j</groupId>
            <artifactId>hashsplit4j-lib</artifactId>
            <version>2.4.1</version>
        </dependency>
    </dependencies>

    <repositories>
        <repository>
            <id>milton-repo</id>
            <url>http://milton.io/maven</url>
        </repository>
    </repositories>

Efficiency:
By "efficient" I mean:
1. Efficient with respect to network bandwidth because only changed data needs to be syncronized
2. Efficient with respect to storage, since file versions can be persisted with only changes across versions being stored
3. Efficient with respect to CPU usage, because files can be easily reconstituted

This library is inspired by BUP - https://github.com/apenwarr/bup/blob/master/DESIGN

One major design choice, different to BUP, is that instead of implementing a tree of
fanouts (as per the BUP design documentation) I've opted for a single level of chunk
grouping. I think this makes it much simpler to parse a file, and also much simpler
to use the result of the parse. I also think that a single level of grouping will
be sufficient for minimising netwok traffic, provided we are more selective then
BUP with grouping...

BUP looks for boundaries by matching the lowest 13 bits of the rolling checksum,
and then groups those chunks when the lowest 17bits are set. This means that the
grouping is on average every 15 chunks (4bits=15), which seems to be too granular.
I think we would probably want to group sets of 256, ie matching 21 bits. Given
that the chunks information ends up 0.25% of the total file size, this means that
the chunk groups will be 1/256 * 0.25% = 0.001% of total file size, about 10k for
a 1Gb file. So I don't think we need to be any more granular then that.

The main parsing class is Parser, which you use like this:
        InputStream in = Scratch.class.getResourceAsStream(fname);
        //InputStream in = Scratch.class.getResourceAsStream("test1.txt");
        MemoryHashStore hashStore = new MemoryHashStore();
        MemoryBlobStore blobStore = new MemoryBlobStore();
        Parser parser = new Parser();
        List<Long> megaCrcs = parser.parse(in, blobStore, hashStore);


Note that there are 2 abstraction layers for storing the parse result
- HashStore: for storing the tree of blobs in groups which is the series of chunks
or blobs in the file
- BlobStore: for storing and retrieving the actual chunks/blobs

public interface HashStore {
    public void onChunk(long crc, int start, int finish, byte[] bytes);
    public void onFanout(long crc, List<Long> childCrcs);
    public List<Long> getCrcsFromFanout(Long fanoutCrc);
    public byte[] getBlob(Long crc);
}

public interface BlobStore {
    void setBlob(String hash, byte[] bytes);
    byte[] getBlob(String hash);
    boolean hasBlob(String hash);
}


Files can be reconstituted with the Combiner class:
        Combiner combiner = new Combiner();
        MemoryHashStore hashStore = new MemoryHashStore();
        MemoryBlobStore blobStore = new MemoryBlobStore();
        ByteArrayOutputStream bout = new ByteArrayOutputStream();
        combiner.combine(megaCrcs, blobStore, hashStore, bout);

There is now a HttpBlobStore which can store blobs to a HTTP server. There is also
a module in the hashsplit4j project which provides such an HTTP server. Work is
currently under way (as of 1/1/2014) to implement efficient syncronisation between
these http servers (rsync is hell with terabytes of blobs)


hashsplit-lib's People

Contributors

bradmac avatar dgks0n avatar tuzzmaniandevil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

hashsplit-lib's Issues

TCP Client/Server for efficient remote sync

I would like to do remote syncronisation as follows

  • client initiates a TCP connection to a storage server, and sends the hash of the root representing the current local state of resources
  • the server begins walking the directory tree from this hash
  • it uses its server-side hash and blob stores as primary
  • where a hash is encountered which is not present on the server side it requests it from the client with the open TCP connection
  • the client responds with the requested data, either as blobs or hash fanouts
  • these are persisted by the server to the server side blob and hash stores
  • Once the server has successfully walked the entire tree, it sets the new hash as the next commit of the branch

The API should look as follows

  1. Server side stores
  2. Server side
    SyncServer server = new SyncServer(); // default to all interfaces and some default port
    SyncListener listener = new SyncListener {
    public void onSync(String hash, BlobStore blobStore, HashStore hashStore) {
    // the app can walk the tree here
    }
    }
    server.start(listener); // the server is now listening for incoming TCP connections
  3. Client side
    SyncClient client = new SyncClient("myserver.com", 9090, blobStore, hashStore); // server address as above, with local stores
    client.sync("abc123"); // provide a REAL sha1 hash

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.