The conserve from sourcefrog

recurse through source

At the moment you can't give a source directory: you must name all the files in order.

Don't restore incomplete backups unless forced

Maybe default to restore from the latest complete backup.

Would be nice to know how this compares to other services

For the README.md: how does conserve compare with rdiff-backup or even rsync?

option to choose compression method, level

Restore mtime

store mtimes
restore mtime on plain files
add tests for restoring mtimes
fix old-version archives to all use the same mtimes
check mtimes of directories restored from old-version archives
restore mtime of symlinks
restore mtime of directories, after creating all entries directly inside them

exclude directories containing a file with a certain name

There is a Linux standard (where?) that cache directories contain a file with a certain name as a marker.

However, I'm not sure it's very commonly used or very important. Probably better to just exclude things by name.

`diff` command

extract files that diff from the tree (or maybe between two versions) and run a command on them - or just say which ones differ

I installed conserve per the instructions on Ubuntu 13.10. I did autogen, configure, make, make check (everything passed), and sudo make install. I then did conserve init /media/joey/joey-work/conserve and it created the directory with the binary file.

I then ran this:

joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:19:39.968262 17188 bzdatawriter.cc:61] Check failed: bytes_read > 0 : Is a directory [21]
*** Check failure stack trace: ***
@ 0x7fd475d17daa (unknown)
@ 0x7fd475d17ce4 (unknown)
@ 0x7fd475d176e6 (unknown)
@ 0x7fd475d174fb (unknown)
@ 0x7fd475d18477 (unknown)
@ 0x41224e conserve::BzDataWriter::store_file()
@ 0x411aaa conserve::BlockWriter::add_file()
@ 0x407d49 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7fd475019de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)

In the backup directory I have a folder called b0000 now. If I call that command again:

joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:16:57.847147 17149 util.cc:46] Check failed: fd > 0 : File exists [17]
*** Check failure stack trace: ***
@ 0x7f867546fdaa (unknown)
@ 0x7f867546fce4 (unknown)
@ 0x7f867546f6e6 (unknown)
@ 0x7f867546f4fb (unknown)
@ 0x7f8675470477 (unknown)
@ 0x41544b conserve::write_proto_to_file()
@ 0x40911a conserve::BandWriter::start()
@ 0x406d75 conserve::Archive::start_band()
@ 0x407d06 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7f8674771de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)

Joey

Split cli to a separate sub crate?

This would probably get clippy working and might work better with Cargo dependencies. A sub crate might break simple cargo install though.

Implement higher-tier backups

Without this every backup stores everything which is pretty inefficient.

restore full contents

Store and restore permissions, owner, group

verify backup against source

Perhaps this is the same as having a diff command.

exclusion patterns

Parallelize compression

Do block compression on a thread pool, writing files as they complete.

I'm assuming here that compression is the expensive operation we want to parallelize, but depending on the observed balance of time possibly hashing could be parallelized too.

Since the index is sorted we need to accept completed blocks in order by the filename that includes them. So this is complementary to, and related to, storing content from multiple files in a single block. Although we know the hash in advance, we shouldn't write the index until the block data has been written.

Because blocks are of limited size we can keep them entirely in memory, and we can just transfer ownership of the block to the worker thread.

Run ~n_cpus compression worker threads.

One main thread reads files, hashes them, accumulates data into blocks. Push the blocks onto a queue for compression, along with a channel through which the worker can indicate that it's complete.

The compression worker thread returns a Result<()>.

pseudocode:

pending_files = Deque()
for file in source:
  while queue.length > N:
    pending_files[0].channels[0].wait()
  while pending_files[0].channels.all_complete(): 
    index.add(pending_files.pop(0))
  file_completion_channels = []
  for block in pack_blocks_for_file:
    file_completion_channels.append(block_channel = channel())
    queue.push(block_bytes, block_channel)  
 pending_files.push(file, file_completion_channels)

remove band head files?

Do we really need band head and tail files? We assume we can list the directory. Do they store any new information as distinct from just the presence of data blocks?

The tail does let us more quickly tell that the band is complete: this could be in the final data block but that would be slower to read.

Do we need to do anything to detect a concurrent attempt to write the same band or block? Maybe easiest to just detect the written data block already exists.

break into blocks on whole files

Start a new file when:

current block is over a maximum compressed byte size, or
number of files in the block is over a particular size (correlated to an index block size limit)

Will require updating read code to handle multiple blocks.

Show date and completion in `versions`

store filenames as lists of components

Synchronize progress and log messages

To emit a log message we should erase the progress bar, print the message, then put the progress
bar back. (Or maybe lazily put it back.)

option to select version for `restore`, `ls`, etc

validate backup

conserve validate exists and checks some invariants, but there are others that could usefully be checked.

store data compression method in block index

glob select files to restore

This is somewhat handled by --exclude on restore but a specific option to select the things you do want would be more straightforward.

Show percent completion for restore

We should easily know at least how many index blocks need to be restored and from that can get rough completion percentage

Resume interrupted backups

Review storage design for better deduplication support

Because I really like the deduplication properties of Borg/Attic, I've been thinking about the data storage pool of conserve. Specifically, how the ability to only reference data in the parent backup limits the extent of deduplication possible.

Because I've thought about this for long enough, I'm dumping a mostly-formed idea here rather than endlessly iterate in 5-minute increments locally 😄

I think this preserves Conserve's append-only approach (modulo the file renaming involved) and human-recoverableness. I also think it's safely lock-free (in the sense that multiple conserve operations can be safely performed in parallel on the same storage pool without relying on filesystem locking).

Assumptions:

Filesystem renames are cheap
Filesystem renames cannot lose file data
Filesystem will handle large numbers of empty files reasonably efficiently
There's a rmdir() or equivalent that will only remove empty directories
Some basic filesystem ordering properties:

If a file a is created before a file b, then a process will never observe a directory listing containing b but not a.

Backup Sequence

Backup 0012 wants to store a data block with SHA256 $HASH

Create empty file pool/$HASH/0012.in-progress
Searches directory listing for data-*
i. If data-$GEN_NUMBER is found, rename() data-$GEN_NUMBER to
data-($GEN_NUMBER + 1)
ii. If no data-* found, write block data to a temporary file then
rename() to data-0001.
rename()s 0012.in-progress to 0012.

Failure conditions:

If (1) fails, then a block with hash $HASH does not already exist;
Backup 0012 creates the directory, then continues to (2 ii).
If (2 i) fails then either the block data has been deleted or another
backup process has moved it. In either case, restart at (2).
If (2 ii) fails then another backup process has already started writing
the data. Restart at (2).

Deletion sequence

To delete backup 0070 Conserve starts with list of blocks from the backup
description. For each block $HASH:

Gets directory listing for pool/$HASH
Deletes pool/$HASH/0070
If this is last non-data-* file in pool/$HASH, according to
precomputed directory listing, then
i. Delete the data-* files present in the listing.
ii. rmdir() pool/$HASH

Failure conditions:

If (1) fails, then this block has already been deleted, possibly by a
previously interrupted delete operation? We need do nothing.
If (2) fails, then this block has already been processed for this
delete operation. We need do nothing
If (3 i) fails, continue to (3 ii)
If (3 ii) fails because the directory is not empty, then we're racing
a backup operation. Do nothing.

'problem' abstraction: choose whether to abort or continue

If there's a problem such as missing or corrupt data, optionally try to continue on the next thing.

Cope with race creating band

ERROR: File exists (os error 17)
stack backtrace:
   0:        0x100e351be - backtrace::backtrace::trace::h1b789d8e1542dad0
   1:        0x100e358fc - backtrace::capture::Backtrace::new::h3d062d22ca4dc5a6
   2:        0x100e34d36 - error_chain::make_backtrace::h48f950ecdd86ad84
   3:        0x100e34ded - _$LT$error_chain..State$u20$as$u20$core..default..Default$GT$::default::he14c64e0c2fc5de7
   4:        0x100e0a3bc - conserve::band::Band::create::h1d318051cfc1c04f
   5:        0x100e07ca4 - conserve::archive::Archive::create_band::h342b7c2956161944
   6:        0x100e088ba - conserve::backup::backup::hca5a5659f78774db
   7:        0x100d9f81a - conserve::backup::h055dd3b66439ec1d
   8:        0x100d9edd7 - conserve::main::h53c12c360c8dc5eb
   9:        0x100e8c38a - __rust_maybe_catch_panic
  10:        0x100e8b906 - std::rt::lang_start::ha9be7b379cf1665e

BlockDir::store doesn't actually flush to disk

Option to restore etc from older band

delete named versions

conserve delete -b b1234 /backup/home.c6

allow specifying multiple bands
gc after deletion
but have an option to not gc

Refuse to restore over existing directory

Unless forced

Show filenames in progress bar

Might require a two-line progress bar...

Skip unreadable source files

Exclusive lock, permission denied etc.

Should keep a count of errors.

Pre-count tree before backup

Just counting the number of files or maybe their total size will let us show percent completion for the whole backup. If the tree changes in between this preview and the actual backup, progress won't be absolutely accurate but that's fine.

This will take some time so should be optional.

change back from gflags

Nice, but

not idiomatic unix style
ugly default help
awkward dependency on older distros

idea: loose storage of large files

A large fraction of data backed up probably will be fairly large and incompressible: jpgs, mpgs, previously-compressed archives, etc.

Trying to compress them will just waste CPU time on insertion and removal

For files above our minimum granularity goal, combining them with others will only slow retrieval.

Storing them as objects with no compression conceivably makes recovery from a badly broken archive easier. However it also runs some risk that they might be directly seen and edited inside the archive, leading to corruption.

We probably don't want a separate band header for every such loose object, since the band heads will be very small and the headers too choppy.

Therefore perhaps we want, multiple data files per band, which might be aggregates, or might be single objects. We could reorder the small/compressible files to be inside the aggregate.

add transport abstraction

maybe leave this until much later and just use FUSE for remote filesystems?

switch to cap'n proto?

http://kentonv.github.io/capnproto

pros:

has proto-to-text and text-to-proto which could be nice for debugging or testing
more compact encoding? (may not make a big difference after gzip)
less memory usage than protos? (probably also not a big deal because we want to be streaming through blocks, but it can't hurt)
actively-maintained open implementation - but, protobuf is still seeing new releases and is widely used

cons:

it does require gcc >= 4.7 which is pretty new: not in Ubuntu Precise but is in Debian Stable
capnproto packages themselves limited availability; not in Debian Jessie
API is a little harder?
maybe more likely to have API churn?
protobuf C++ fast enough for Google (maybe not quite the same code)

irrelevant:

zero serialization work: not a big deal, unlikely to be a dominant factor for Conserve
does have an RPC system

create and read archive header

if compressing a block didn't help much, just store it uncompressed

Warn when reading from an incomplete band

WISHLIST: Compare against Back in Time

Hi Martin,

Comparing Conserve

https://github.com/sourcefrog/conserve/wiki/Compared-to-others

against

http://backintime.le-web.org/

which is part of many Linux distro's repos would be helpful. It's very fast (rsync fast) yet does a lot of things that rsync does not. I'm personally interested in this particular comparison because I use backintime today to do system backups but I like your approach better.

Joey

sourcefrog / conserve Goto Github PK

conserve's People

Contributors

Stargazers

Watchers

Forkers

conserve's Issues

Assumptions:

Backup Sequence

Failure conditions:

Deletion sequence

Failure conditions:

Recommend Projects

Recommend Topics

Recommend Org