Comments (4)
Thanks for thinking about this. I think Glacier is important, but just haven't had time to work on it. The .bs files are caches of data about differences between snapshots, which are used to decide what to send (e.g. which snapshots should be pulled out of glacier). They can (mostly) be recreated, so are helpful but not required. Since the .bs files are small, I'd lean towards something like #3 -- or maybe reorganize them so they are easier to carve out with a transition rule.
from buttersink.
Something's wrong, beyond the .bs files.
As a temporary solution before putting my hands in the buttersink code, I wrote this:
#!/usr/bin/env python
""""Tag all buttersink snapshot files with the ToGlacier: Y tag, so that
they can be easily picked up by a transition rule. Exclude files in trash
and .bs manifests.
"""
import boto3
BUCKET = 'crusaderky-buttersink'
client = boto3.client('s3')
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET)
for obj in bucket.objects.all():
if obj.key.endswith('.bs') or obj.key.startswith('trash/'):
print("Skipped: " + obj.key)
else:
print("Tagged: " + obj.key)
client.put_object_tagging(
Bucket=BUCKET,
Key=obj.key,
Tagging={'TagSet': [{'Key': 'ToGlacier', 'Value': 'Y'}]})
Then I implemented a transition rule that moves everything with tag ToGlacier=Y to Glacier after 1 day.
I run buttersink:
buttersink --part-size 500 /btrfs/ezgi/ s3://crusaderky-buttersink/ezgi/
followed by my script, and 2 days later everything looks good:
# buttersink /btrfs/ezgi/
988b64fb-4ebe-3b4f-a3e1-b1b7501e17a1 /btrfs/ezgi/20170320-000001 (611.4 GiB 99.21 MiB exclusive)
2842be75-49bb-1345-ab3c-4146f2cc5ef3 /btrfs/ezgi/20170930-000002 (653.5 GiB 83.48 MiB exclusive)
a36ace5b-6c40-c94a-8234-ce5ad52244b8 /btrfs/ezgi/20171218-223900 (653.4 GiB 16 KiB exclusive)
# head /btrfs/ezgi/*.bs
==> /btrfs/ezgi/20170930-000002.bs <==
2842be75-49bb-1345-ab3c-4146f2cc5ef3 988b64fb-4ebe-3b4f-a3e1-b1b7501e17a1 47178989544
==> /btrfs/ezgi/20171218-223900.bs <==
a36ace5b-6c40-c94a-8234-ce5ad52244b8 2842be75-49bb-1345-ab3c-4146f2cc5ef3 36083633
# buttersink s3://crusaderky-buttersink/ezgi/
Listing S3 Bucket "crusaderky-buttersink" contents...
988b...17a1 /ezgi/20170320-000001 from None (701.9 GiB)
2842...5ef3 /ezgi/20170930-000002 from 988b...17a1 /ezgi/20170320-000001 (43.94 GiB)
a36a...44b8 /ezgi/20171218-223900 from 2842...5ef3 /ezgi/20170930-000002 (34.41 MiB)
TOTAL: 3 diffs 745.9 GiB
Files on s3:
ezgi/20170320-000001/988b64fb-4ebe-3b4f-a3e1-b1b7501e17a1_None (701.9GB, Glacier)
ezgi/20170930-000002/2842be75-49bb-1345-ab3c-4146f2cc5ef3_988b64fb-4ebe-3b4f-a3e1-b1b7501e17a1 (43.9GB, Standard, tagged - will be moved to Glacier tomorrow)
ezgi/20171218-223900/a36ace5b-6c40-c94a-8234-ce5ad52244b8_2842be75-49bb-1345-ab3c-4146f2cc5ef3 (34.4MB, Standard, tagged - will be moved to Glacier tomorrow)
ezgi/20170930-000002.bs (Standard, no tag)
ezgi/20171218-223900.bs (Standard, no tag)
However, now I run buttersink again (same command as above), which in theory should simply recognize that everything's already aligned and just exit, it malfunctions and tries to re-send the last snapshot, but with a (suboptimal) changed parent:
# buttersink -n /btrfs/ezgi/ s3://crusaderky-buttersink/ezgi/
Listing S3 Bucket "crusaderky-buttersink" contents...
Measuring a36a...44b8 /btrfs/ezgi/20171218-223900 from 988b...17a1 /btrfs/ezgi/20170320-000001 (~44.9 GiB)
^C:00:35.378090: Sent 1.406 GiB of 44.9 GiB (3%) ETA: 0:18:14.194913 (341 Mbps )
# buttersink -e -n /btrfs/ezgi/ s3://crusaderky-buttersink/ezgi/
Listing S3 Bucket "crusaderky-buttersink" contents...
Optimal synchronization:
45.74 GiB from 2 diffs in btrfs /btrfs/ezgi
701.9 GiB from 1 diffs in S3 Bucket "crusaderky-buttersink"
747.6 GiB from 3 diffs in TOTAL
Keep: 988b...17a1 /ezgi/20170320-000001 from None (701.9 GiB)
WOULD: Xfer: a36a...44b8 /btrfs/ezgi/20171218-223900 from 988b...17a1 /btrfs/ezgi/20170320-000001 (~44.9 GiB)
WOULD: Xfer: 2842...5ef3 /btrfs/ezgi/20170930-000002 from a36a...44b8 /btrfs/ezgi/20171218-223900 (~856.2 MiB)
I think this has to do with 20170320-000001 being in Glacier, but I am not 100% sure anymore...
from buttersink.
Hmmmm... thanks for trying this. It seems to recognize and keep the 20170320 in Glacier, so at least some glacier diffs are being recognized. In the last run, the estimates might be causing it to assume (incorrectly) that it can increase the storage efficiency by transferring additional diffs. It really should be using the actual diff size in Glacier. I'm curious -- if you let then measuring xfer continue (buttersink without the -e flag), would it come up with a more appropriate plan?
from buttersink.
Adding the -d flag gives me some new insight of what's going on:
buttersink -d -e -n /btrfs/ezgi/ s3://crusaderky-buttersink/ezgi/
Listing S3 Bucket "crusaderky-buttersink" contents...
Optimal synchronization:
653.6 GiB from 3 diffs in btrfs /btrfs/ezgi
653.6 GiB from 3 diffs in TOTAL
WOULD: Xfer: a36a...44b8 /btrfs/ezgi/20171218-223900 from None (653.4 GiB)
WOULD: Xfer: 988b...17a1 /btrfs/ezgi/20170320-000001 from a36a...44b8 /btrfs/ezgi/20171218-223900 (147 MiB)
WOULD: Xfer: 2842...5ef3 /btrfs/ezgi/20170930-000002 from a36a...44b8 /btrfs/ezgi/20171218-223900 (91.46 MiB)
WOULD: Trash: 988b...17a1 /ezgi/20170320-000001 from None (701.9 GiB)
WOULD: Trash: a36a...44b8 /ezgi/20171218-223900 from 2842...5ef3 /ezgi/20170930-000002 (34.41 MiB)
WOULD: Trash: 2842...5ef3 /ezgi/20170930-000002 from 988b...17a1 /ezgi/20170320-000001 (43.94 GiB)
Trashed 3 diffs (745.9 GiB)
And as you predicted, removing -e changes things a lot:
# buttersink -n /btrfs/ezgi/ s3://crusaderky-buttersink/ezgi/
Listing S3 Bucket "crusaderky-buttersink" contents...
measured size (43.97 GiB), estimated size (43.97 GiB)
Optimal synchronization:
43.97 GiB from 2 diffs in btrfs /btrfs/ezgi
701.9 GiB from 1 diffs in S3 Bucket "crusaderky-buttersink"
745.9 GiB from 3 diffs in TOTAL
Keep: 988b...17a1 /ezgi/20170320-000001 from None (701.9 GiB)
WOULD: Xfer: a36a...44b8 /btrfs/ezgi/20171218-223900 from 988b...17a1 /btrfs/ezgi/20170320-000001 (43.88 GiB)
WOULD: Xfer: 2842...5ef3 /btrfs/ezgi/20170930-000002 from a36a...44b8 /btrfs/ezgi/20171218-223900 (91.46 MiB)
My .bs files after the latest command:
==> 20170320-000001.bs <==
988b64fb-4ebe-3b4f-a3e1-b1b7501e17a1 a36ace5b-6c40-c94a-8234-ce5ad52244b8 154108293
==> 20170930-000002.bs <==
2842be75-49bb-1345-ab3c-4146f2cc5ef3 988b64fb-4ebe-3b4f-a3e1-b1b7501e17a1 47178989544
2842be75-49bb-1345-ab3c-4146f2cc5ef3 a36ace5b-6c40-c94a-8234-ce5ad52244b8 95900770
==> 20171218-223900.bs <==
a36ace5b-6c40-c94a-8234-ce5ad52244b8 2842be75-49bb-1345-ab3c-4146f2cc5ef3 36083633
a36ace5b-6c40-c94a-8234-ce5ad52244b8 988b64fb-4ebe-3b4f-a3e1-b1b7501e17a1 47119195763
So there are two problems here:
- -e is inaccurate (ok), but that can lead to revert previous decisions (not ok). The fix for it would be to have the .bs files also record the size from None.
- on the initial upload, buttersink (always without -e flag!) measured 20170930 vs 20170320, and 2011218 vs. 20170930, but not the other way around. On the second upload attempt, buttersink measured it the other way around. Even if the latter is a few kilobytes larger, for some reason buttersink is now trying to re-upload the whole thing (argh). This is bad because
- re-uploading 50GB takes a lot
- if you delete anything after keeping it in glacier for less than 3 months, you pay for the full 3 months anyway, incurring in a huge bill
- there's no guarantee that buttersink won't redo the trick again in the future, possibly reverting to the initial situation, maybe triggered by a new snapshot or a minor code change
This problem looks like it's completely unreated to glacier. What we need is for buttersink to re-use already existing snapshots on s3 whenever possible, even if its logic (agnostic from existing snapshots) would tell it to do otherwise for any reason.
I finally tried adding a digit to the second line of 20171218-223900.bs, in an attempt to make buttersink chose the 47GB option instead of the 470GB one. It doesn't:
buttersink -n /btrfs/ezgi/ s3://crusaderky-buttersink/ezgi/
Listing S3 Bucket "crusaderky-buttersink" contents...
measured size (438.9 GiB), estimated size (438.9 GiB)
Optimal synchronization:
438.9 GiB from 2 diffs in btrfs /btrfs/ezgi
701.9 GiB from 1 diffs in S3 Bucket "crusaderky-buttersink"
1.114 TiB from 3 diffs in TOTAL
Keep: 988b...17a1 /ezgi/20170320-000001 from None (701.9 GiB)
WOULD: Xfer: a36a...44b8 /btrfs/ezgi/20171218-223900 from 988b...17a1 /btrfs/ezgi/20170320-000001 (438.8 GiB)
WOULD: Xfer: 2842...5ef3 /btrfs/ezgi/20170930-000002 from a36a...44b8 /btrfs/ezgi/20171218-223900 (91.46 MiB)
from buttersink.
Related Issues (20)
- Remove destination folder if it exists HOT 4
- Migrate to python3
- Refactoring: automated unit testing HOT 2
- Btrfs <-> Btrfs transfer optimization and multiple sources HOT 7
- Store.listVolumes() unreliable when subvolid=5 mounted somewhere else
- buttersink can't rely on Received UUID uniqueness HOT 9
- Buttersink causes lock up on Kernel 4.17 HOT 3
- S3 endpoint iterates through untargeted folders HOT 1
- Partial snapshot updates? HOT 7
- empty stream is not considered valid HOT 9
- relative path while 'keeping' - no such file or directory HOT 4
- Detect snapper cleanup timer
- Issue with quota HOT 4
- buttersink fails transferring large snapshot HOT 3
- error File exists HOT 2
- Buttersink measures unrelated snapshots when syncing. HOT 1
- .bs files automatic removal HOT 2
- S3 /trash moves fail > 5GB
- What is the SSH syntax to use a different port? HOT 1
- S3 compatible storage HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from buttersink.