sourcefrog / conserve Goto Github PK
View Code? Open in Web Editor NEW๐ฒ Robust file backup tool in Rust
License: Other
๐ฒ Robust file backup tool in Rust
License: Other
At the moment you can't give a source directory: you must name all the files in order.
Maybe default to restore from the latest complete backup.
For the README.md: how does conserve compare with rdiff-backup or even rsync?
There is a Linux standard (where?) that cache directories contain a file with a certain name as a marker.
However, I'm not sure it's very commonly used or very important. Probably better to just exclude things by name.
extract files that diff from the tree (or maybe between two versions) and run a command on them - or just say which ones differ
Hi Martin,
I installed conserve per the instructions on Ubuntu 13.10. I did autogen, configure, make, make check (everything passed), and sudo make install. I then did conserve init /media/joey/joey-work/conserve and it created the directory with the binary file.
I then ran this:
joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:19:39.968262 17188 bzdatawriter.cc:61] Check failed: bytes_read > 0 : Is a directory [21]
*** Check failure stack trace: ***
@ 0x7fd475d17daa (unknown)
@ 0x7fd475d17ce4 (unknown)
@ 0x7fd475d176e6 (unknown)
@ 0x7fd475d174fb (unknown)
@ 0x7fd475d18477 (unknown)
@ 0x41224e conserve::BzDataWriter::store_file()
@ 0x411aaa conserve::BlockWriter::add_file()
@ 0x407d49 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7fd475019de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)
In the backup directory I have a folder called b0000 now. If I call that command again:
joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:16:57.847147 17149 util.cc:46] Check failed: fd > 0 : File exists [17]
*** Check failure stack trace: ***
@ 0x7f867546fdaa (unknown)
@ 0x7f867546fce4 (unknown)
@ 0x7f867546f6e6 (unknown)
@ 0x7f867546f4fb (unknown)
@ 0x7f8675470477 (unknown)
@ 0x41544b conserve::write_proto_to_file()
@ 0x40911a conserve::BandWriter::start()
@ 0x406d75 conserve::Archive::start_band()
@ 0x407d06 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7f8674771de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)
Joey
This would probably get clippy working and might work better with Cargo dependencies. A sub crate might break simple cargo install
though.
Without this every backup stores everything which is pretty inefficient.
Perhaps this is the same as having a diff command.
Do block compression on a thread pool, writing files as they complete.
I'm assuming here that compression is the expensive operation we want to parallelize, but depending on the observed balance of time possibly hashing could be parallelized too.
Since the index is sorted we need to accept completed blocks in order by the filename that includes them. So this is complementary to, and related to, storing content from multiple files in a single block. Although we know the hash in advance, we shouldn't write the index until the block data has been written.
Because blocks are of limited size we can keep them entirely in memory, and we can just transfer ownership of the block to the worker thread.
Run ~n_cpus compression worker threads.
One main thread reads files, hashes them, accumulates data into blocks. Push the blocks onto a queue for compression, along with a channel through which the worker can indicate that it's complete.
The compression worker thread returns a Result<()>
.
pseudocode:
pending_files = Deque()
for file in source:
while queue.length > N:
pending_files[0].channels[0].wait()
while pending_files[0].channels.all_complete():
index.add(pending_files.pop(0))
file_completion_channels = []
for block in pack_blocks_for_file:
file_completion_channels.append(block_channel = channel())
queue.push(block_bytes, block_channel)
pending_files.push(file, file_completion_channels)
Do we really need band head and tail files? We assume we can list the directory. Do they store any new information as distinct from just the presence of data blocks?
The tail does let us more quickly tell that the band is complete: this could be in the final data block but that would be slower to read.
Do we need to do anything to detect a concurrent attempt to write the same band or block? Maybe easiest to just detect the written data block already exists.
Start a new file when:
Will require updating read code to handle multiple blocks.
To emit a log message we should erase the progress bar, print the message, then put the progress
bar back. (Or maybe lazily put it back.)
conserve validate
exists and checks some invariants, but there are others that could usefully be checked.
This is somewhat handled by --exclude
on restore but a specific option to select the things you do want would be more straightforward.
We should easily know at least how many index blocks need to be restored and from that can get rough completion percentage
Because I really like the deduplication properties of Borg/Attic, I've been thinking about the data storage pool of conserve. Specifically, how the ability to only reference data in the parent backup limits the extent of deduplication possible.
Because I've thought about this for long enough, I'm dumping a mostly-formed idea here rather than endlessly iterate in 5-minute increments locally ๐
I think this preserves Conserve's append-only approach (modulo the file renaming involved) and human-recoverableness. I also think it's safely lock-free (in the sense that multiple conserve operations can be safely performed in parallel on the same storage pool without relying on filesystem locking).
Filesystem renames are cheap
Filesystem renames cannot lose file data
Filesystem will handle large numbers of empty files reasonably efficiently
There's a rmdir() or equivalent that will only remove empty directories
Some basic filesystem ordering properties:
a
is created before a file b
, then a process will never observe a directory listing containing b
but not a
.Backup 0012 wants to store a data block with SHA256 $HASH
pool/$HASH/0012.in-progress
data-*
data-$GEN_NUMBER
is found, rename() data-$GEN_NUMBER
todata-($GEN_NUMBER + 1)
data-*
found, write block data to a temporary file thendata-0001
.0012.in-progress
to 0012
.$HASH
does not already exist;To delete backup 0070 Conserve starts with list of blocks from the backup
description. For each block $HASH:
pool/$HASH
pool/$HASH/0070
data-*
file in pool/$HASH
, according todata-*
files present in the listing.pool/$HASH
If there's a problem such as missing or corrupt data, optionally try to continue on the next thing.
ERROR: File exists (os error 17)
stack backtrace:
0: 0x100e351be - backtrace::backtrace::trace::h1b789d8e1542dad0
1: 0x100e358fc - backtrace::capture::Backtrace::new::h3d062d22ca4dc5a6
2: 0x100e34d36 - error_chain::make_backtrace::h48f950ecdd86ad84
3: 0x100e34ded - _$LT$error_chain..State$u20$as$u20$core..default..Default$GT$::default::he14c64e0c2fc5de7
4: 0x100e0a3bc - conserve::band::Band::create::h1d318051cfc1c04f
5: 0x100e07ca4 - conserve::archive::Archive::create_band::h342b7c2956161944
6: 0x100e088ba - conserve::backup::backup::hca5a5659f78774db
7: 0x100d9f81a - conserve::backup::h055dd3b66439ec1d
8: 0x100d9edd7 - conserve::main::h53c12c360c8dc5eb
9: 0x100e8c38a - __rust_maybe_catch_panic
10: 0x100e8b906 - std::rt::lang_start::ha9be7b379cf1665e
conserve delete -b b1234 /backup/home.c6
Unless forced
Might require a two-line progress bar...
Exclusive lock, permission denied etc.
Should keep a count of errors.
Just counting the number of files or maybe their total size will let us show percent completion for the whole backup. If the tree changes in between this preview and the actual backup, progress won't be absolutely accurate but that's fine.
This will take some time so should be optional.
Nice, but
A large fraction of data backed up probably will be fairly large and incompressible: jpgs, mpgs, previously-compressed archives, etc.
Trying to compress them will just waste CPU time on insertion and removal
For files above our minimum granularity goal, combining them with others will only slow retrieval.
Storing them as objects with no compression conceivably makes recovery from a badly broken archive easier. However it also runs some risk that they might be directly seen and edited inside the archive, leading to corruption.
We probably don't want a separate band header for every such loose object, since the band heads will be very small and the headers too choppy.
Therefore perhaps we want, multiple data files per band, which might be aggregates, or might be single objects. We could reorder the small/compressible files to be inside the aggregate.
maybe leave this until much later and just use FUSE for remote filesystems?
http://kentonv.github.io/capnproto
pros:
cons:
irrelevant:
Hi Martin,
Comparing Conserve
https://github.com/sourcefrog/conserve/wiki/Compared-to-others
against
which is part of many Linux distro's repos would be helpful. It's very fast (rsync fast) yet does a lot of things that rsync does not. I'm personally interested in this particular comparison because I use backintime today to do system backups but I like your approach better.
Joey
Probably look at the extension or filename pattern (eg to handle git packs).
Might be simplest to compress them at level 0.
Conserve currently only writes at about 1MB/s on a fast machine. Maybe Brotli is a bad choice here and we should try gzip.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.