Giter Site home page Giter Site logo

zfs_uploader's People

Contributors

alexanderlieret avatar ddebeau avatar erisa avatar jdeluyck avatar jrothrock avatar rdelcorro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

zfs_uploader's Issues

Create SnapshotDB object

Code like snapshot_name = f'{self._file_system}@{backup_time}' and list_snapshots() should really be part of a greater SnapshotDB object so we can document and reuse code efficiently. The Backup object should also have a snapshot parameter so we can easily restore from backup.

backup_info functions should be moved from ZFSjob to their own object

It would improve the readability and organization of the code to move the backup_info functions to their own object. It would also allow us to better support cross-compatibility with old and new backup_info formats.

def _read_backup_info(self):
info_object = self._s3.Object(self._bucket,
f'{self._filesystem}/backup.info')
try:
with BytesIO() as f:
info_object.download_fileobj(f)
f.seek(0)
return json.load(f)
except ClientError:
return {}
def _write_backup_info(self, backup_info):
info_object = self._s3.Object(self._bucket,
f'{self._filesystem}/backup.info')
with BytesIO() as f:
f.write(json.dumps(backup_info).encode('utf-8'))
f.seek(0)
info_object.upload_fileobj(f)
def _set_backup_info(self, key, file_system, backup_time, backup_type):
backup_info = self._read_backup_info()
backup_info[key] = {'file_system': file_system,
'backup_time': backup_time,
'backup_type': backup_type}
self._write_backup_info(backup_info)
def _del_backup_info(self, key):
backup_info = self._read_backup_info()
backup_info.pop(key)
self._write_backup_info(backup_info)

Adaptive storageclass

Hi, I'm the author of https://github.com/andaag/zfs-to-glacier/ and I occasionally go hunting for alternative versions, to see if someone has invested more in this problem than me and I can stop maintaining this.

One thing I see you are missing is storage class depending on size. I got a fairly ugly hack here : https://github.com/andaag/zfs-to-glacier/blob/main/src/main.rs#L95 to adjust the storage class to standard in cases where the file size is too small for glacier. (In these cases you pay a premium, and it's actually cheaper with standard storage).

Glacier minimum billable size is 128kb. Quite a lot of incremental backups can be 1kb, so Standard storage is a very clear winner!

Send stream could be sent compressed if the dataset is compressed

If the dataset being backed up has a compression property set to anything other than off, the default behaviour of zfs send is to decompress on the fly and send the full uncompressed dataset.

Simply by adding a -c, --compressed flag to zfs send, this will instead be sent compressed and takes up significantly less space on the remote. In my case this reduced a full backup of a PostgreSQL database from 56 GB to 24 GB.

I added this flag to my personal fork in Erisa@c192333 and noticed no regressions or repercussions, however since users may not always have their dataset set to compress or want this behaviour to change across versions, I believe the best way forward would be to add a zfs_uploader config variable that will enable this compressed flag.

CLI entrypoint doesn't work

The CLI entrypoint is executable-name instead of zfsup and the command returns this traceback:

executable-name --version
Traceback (most recent call last):
  File "<redacted>/.env/zfs_uploader/bin/executable-name", line 5, in <module>
    from zfs_uploader.__main__ import main
ImportError: cannot import name 'main' from 'zfs_uploader.__main__' (<redacted>/.env/zfs_uploader/lib/python3.7/site-packages/zfs_uploader/__main__.py)

Encrypt snapshots

Hello!

Are there any plans to implement encryption of snapshots for unencrypted pools / data sources before uploading to s3?

Example:

zfs send pool/dataset@snapshot-name | gpg --symmetric --cipher-algo AES256 -o /path/to/encrypted-snapshot.gpg

Update pruning

Are there any plans to add a sort of pruning schedule? Similar to borg's prune command with the --keep-hourly, --keep-daily, etc. options.

FreeBSD

Can this be installed/used on FreeBSD?

zfs receive does not work if the destination filesystem has been changed since the most recent snapshot

Error:

cat 20210607_020000.inc | zfs receive <filesystem>@20210607_020000
cannot receive incremental stream: destination <filesystem> has been modified
since most recent snapshot

zfsup restore <filesystem> 20210607_020000 doesn't return anything. If you wrap the restore command in a try except statement you'll get a BrokenPipeError. We'll want to catch that and let the user know what the problem likely is. The -F option forces a rollback to the most recent snapshot. We should add an option for that.

Delta backups done against full rather than incrementals

Hi @ddebeau I am seeing in my aws console that the incrementals created are based on the full rather than on previews incremental.

image

If you look at my backups, they are getting bigger and bigger duplicating a lot of data. I can't create a new full as it needs to be stored for 6 months due to deep archive constraints. I am open to create a PR to solve this but I wanted your opinion on the subject.

This affects:

  • Backups: They need to look into the latest incremental and use that as a base
  • Restores: They need to restore the full and all the chain of incrementals up to the point where the user specifies in time

The savings would be massive across time

Thanks

Recursive snapshots

First of all, thank you very much for sharing this project! I'm a newcomer to ZFS and I was able to get started in minutes thanks to the excellent documentation!

As I said before, I'm fairly new to ZFS. I've managed to put together a server with the following datasets:

$ zfs list
NAME                    USED  AVAIL     REFER  MOUNTPOINT
zfspool                 244G  5.09T      467M  /zfspool
zfspool/bulkstorage     145G  5.09T      145G  /zfspool/bulkstorage
zfspool/vm-100-disk-0     3M  5.09T      120K  -
zfspool/vm-100-disk-1  33.0G  5.12T     3.30G  -
zfspool/vm-102-disk-0  66.0G  5.13T     20.6G  -
zfspool/vm-102-disk-1     3M  5.09T      120K  -

My config file looks like this (with redacted info styled <like this>):

[DEFAULT]
bucket_name = <bucket>
region = <region>
access_key = <access key>
secret_key = <secret key>
storage_class = STANDARD
endpoint = <endpoint>

[zfspool]
cron = 0 2 * * *
max_snapshots = 7
max_incremental_backups_per_full = 6
max_backups = 7

I naively expected zfs_uploader to recursively snapshot and upload backups of each dataset within zfspool, however it only did so for the data specifically stored in the zfspool mount point that was not part of any of the children.

Is this supported by zfs_uploader, or do I need to specify each dataset manually? Ex. [zfspool/bulkstorage], [zfspool/vm-100-disk-0], etc.

Also related, I did find that ZFS supports recursive snapshots, but I haven't tried it yet: https://docs.oracle.com/cd/E19253-01/819-5461/gdfdt/index.html

Part number limit is reached when uploading large snapshots

The following traceback occurs when uploading large snapshots:

botocore.exceptions.ClientError: An error occurred (InvalidArgument) when calling the UploadPart operation: Part number must be an integer between 1 and 10000, inclusive

The error is caused when the part number limit (10,000) is reached. We'll need to adjust part size ourselves instead of letting Boto do it.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html

Add GitHub Actions testing

We should be able to run the test suite in the standard GitHub Actions linux runner. zpool can create pools from files.

Add basic CLI

A basic CLI should be added that has the following functions:

  • Display version
  • Display help
  • List backups
  • Backup
  • Restore

pip install fails due to bad script entry point

ERROR: For req: zfs_uploader. Invalid script entry point: <ExportEntry zfsup = zfs_uploader.__main__:None []> - A callable suffix is required. Cf https://packaging.python.org/specifications/entry-points/#use-for-scripts for more information.

Move S3 upload code to BackupDB

We should move the S3 upload code to BackupDB so that it works like SnapshotDB in that it can create the objects it references. The job code should handle the complicated stuff that involves backups and snapshots.

s3_key = f'{self._file_system}/{backup_time}.full'
self._logger.info(f'[{s3_key}] Starting full backup.')
with open_snapshot_stream(self.filesystem, backup_time, 'r') as f:
self._bucket.upload_fileobj(f.stdout,
s3_key,
Config=self._s3_transfer_config,
ExtraArgs={
'StorageClass': self._storage_class
})
stderr = f.stderr.read().decode('utf-8')
if f.returncode:
raise ZFSError(stderr)
self._check_backup(s3_key)
self._backup_db.create_backup(backup_time, 'full', s3_key)

Configuring different max part numbers for some S3 providers

When setting up zfs_uploader against Scaleway Object Storage (Specifically their GLACIER tier), everything worked as expected except with one caveat: The max part number on Scaleway is 1000 rather than the 10000 used by AWS.

This resulted in an error when uploading with the default setup, since it calculated the part sizes based on 10,000 parts and eventually failed due to exceeding Scaleway's limit of 1,000 parts.

I resolved this for my use-case by simply modifying a number in job.py: Erisa@20ed42f however I feel that going forward it would be a good idea to allow configuration of this value in the zfs_uploader configuration file, and document it on the README.

You could also detect and change the values based on predefined provider limits, however it still would be nice to have the value in a user-configurable place.

Restore full backup only when necessary

When restoring an incremental backup we shouldn't restore the full backup if the snapshot used for the full backup still exists on the system.

elif backup_type == 'inc':
# restore full backup first
backup_full = self._backup_db.get_backup(backup.dependency)
self._restore_snapshot(backup_full)
self._restore_snapshot(backup)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.