dgks0n / hashsplit-lib Goto Github PK
View Code? Open in Web Editor NEWThis project forked from hashsplit4j/hashsplit-lib
Implementation of the BUP-like hashsplitting algorithm in hava
This project forked from hashsplit4j/hashsplit-lib
Implementation of the BUP-like hashsplitting algorithm in hava
This is a library for splitting files into chunks which are fairly stable with respect to file changes, which allows those chunks to be used as the basis for efficient file syncronisation. Latest build: 2.1.1 http://milton.io/maven/hashsplit4j/hashsplit4j-lib/2.1.1/hashsplit4j-lib-2.1.1.jar Maven: The hashsplit4j jar is published to the milton.io maven repo. To use it add the dependency as normal, and add the milton repo.. <dependencies> ... <dependency> <groupId>hashsplit4j</groupId> <artifactId>hashsplit4j-lib</artifactId> <version>2.1.1</version> </dependency> </dependencies> <repositories> <repository> <id>milton-repo</id> <url>http://milton.io/maven</url> </repository> </repositories> Efficiency: By "efficient" I mean: 1. Efficient with respect to network bandwidth because only changed data needs to be syncronized 2. Efficient with respect to storage, since file versions can be persisted with only changes across versions being stored 3. Efficient with respect to CPU usage, because files can be easily reconstituted This library is inspired by BUP - https://github.com/apenwarr/bup/blob/master/DESIGN One major design choice, different to BUP, is that instead of implementing a tree of fanouts (as per the BUP design documentation) I've opted for a single level of chunk grouping. I think this makes it much simpler to parse a file, and also much simpler to use the result of the parse. I also think that a single level of grouping will be sufficient for minimising netwok traffic, provided we are more selective then BUP with grouping... BUP looks for boundaries by matching the lowest 13 bits of the rolling checksum, and then groups those chunks when the lowest 17bits are set. This means that the grouping is on average every 15 chunks (4bits=15), which seems to be too granular. I think we would probably want to group sets of 256, ie matching 21 bits. Given that the chunks information ends up 0.25% of the total file size, this means that the chunk groups will be 1/256 * 0.25% = 0.001% of total file size, about 10k for a 1Gb file. So I don't think we need to be any more granular then that. The main parsing class is Parser, which you use like this: InputStream in = Scratch.class.getResourceAsStream(fname); //InputStream in = Scratch.class.getResourceAsStream("test1.txt"); MemoryHashStore hashStore = new MemoryHashStore(); MemoryBlobStore blobStore = new MemoryBlobStore(); Parser parser = new Parser(); List<Long> megaCrcs = parser.parse(in, blobStore, hashStore); Note that there are 2 abstraction layers for storing the parse result - HashStore: for storing the tree of blobs in groups which is the series of chunks or blobs in the file - BlobStore: for storing and retrieving the actual chunks/blobs public interface HashStore { public void onChunk(long crc, int start, int finish, byte[] bytes); public void onFanout(long crc, List<Long> childCrcs); public List<Long> getCrcsFromFanout(Long fanoutCrc); public byte[] getBlob(Long crc); } public interface BlobStore { void setBlob(String hash, byte[] bytes); byte[] getBlob(String hash); boolean hasBlob(String hash); } Files can be reconstituted with the Combiner class: Combiner combiner = new Combiner(); MemoryHashStore hashStore = new MemoryHashStore(); MemoryBlobStore blobStore = new MemoryBlobStore(); ByteArrayOutputStream bout = new ByteArrayOutputStream(); combiner.combine(megaCrcs, blobStore, hashStore, bout); There is now a HttpBlobStore which can store blobs to a HTTP server. There is also a module in the hashsplit4j project which provides such an HTTP server. Work is currently under way (as of 1/1/2014) to implement efficient syncronisation between these http servers (rsync is hell with terabytes of blobs)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.