Giter Site home page Giter Site logo

numblr / glaciertools Goto Github PK

View Code? Open in Web Editor NEW
69.0 7.0 18.0 3.89 MB

Command line (bash) scripts to upload large files to AWS glacier using multipart upload and to calculate the required tree hash

License: MIT License

Shell 84.44% Python 12.44% Dockerfile 3.12%
aws-glacier amazon-glacier shell-scripts bash-script command-line-tool treehash merkel-tree glacier hash-functions command-line

glaciertools's Introduction

Command line tools (Bash scripts) to upload large files to AWS Glacier

An archive containing only the scripts can be downloaded from the releases page. Some of the scripts depend on others and assume that they are in the same directory.

Commands

glacierupload
glacierabort
treehash

glacierupload

The script orchestrates the multipart upload of a large file to AWS Glacier.

Prerequisites

This script depends on openssl and parallel. If you are on Mac OS X and using Homebrew, then run the following:

brew install parallel
brew install openssl

The script assumes you have an AWS account, and have signed up for the glacier service and have created a vault already.

It also assumes that you have the AWS Command Line Interface installed on your machine, e.g. by:

pip install awscli

The script requires also that the aws cli is configured with your AWS credentials. Optionally it supports profiles setup in the aws cli by

aws --profile myprofile configure

You can verify that your connection works by describing the vault you have created:

aws --profile myprofile glacier describe-vault --vault-name myvault --account-id -

Script Usage

glacierupload [-p|--profile <profile>] [-d|--description <description>] [-s|--split-size <level>]
               <-v|--vault vault> <file...>

-v --vault        name of the vault to which the file should be uploaded  
-p --profile      optional profile name to use for the upload. The profile
                  name must be configured with the aws cli client.
-d --description  optinal description of the file
-s --split-size   level that determines the size of the parts used for
                  uploading the file. The level can be a number between
                  0 and 12 and results in part size of (2^level) MBytes.
                  If not specified the default is 0, i.e. the file is
                  uploaded in 1MByte parts.
-h --help         print help message

The script prints the information about the upload to the shell and additionally stores it in a file in the directory were the script is executed. The file name equals the original file name postfixed with the first 8 characters of the archive id and '.upload.json'.

The script splits the file to upload on the fly and only stores parts that are currently uploaded temporarily on disk, i.e. the amount of required free disk space is low and depends on the used chunk size and number of parallel uploads. The size of the individual chunks can be controlled by the --split-size option. The number of parallel uploads is determined by parallel based on the number of available CPUs.

Be aware of the constraints on the number and size of the chunks in the AWS Glacier specifications!

In case the upload of a part fails, the script performs a number of retries. If the upload of a part ultimately fails after the maximum number of retries, the script aborts the upload and terminates.

Examples

To simply upload /path/to/my/archive to myvault use

> ./glacierupload -v myvault /path/to/my/archive

This will upload the archive in 1MByte chunks using the standard credentials that are configured for the aws cli.

The following command

> ./glacierupload -p my_aws_cli_profile -v myvault -s 5 -d "My favorite archive" /path/to/my/archive

will upload /path/to/my/archive to myvault on AWS glacier with a short description. The credentials that were configured in the my_aws_cli_profile in the aws cli will be used. Instead of the default part size of 1MB the archive is uploaded in 2^5=32MByte chunks.

glacierabort

Abort (close) all unfinished uploads to a vault on AWS Glacier.

Script Usage

glacierabort -v|--vault <vault> [-p|--profile <profile>]

-v --vault        name of the vault for which uploads should be aborted  
-p --profile      optional profile name to use. The profile name must be
                  configured with the aws cli client.
-h --help         print help message

Examples

To abort all currently unfinished uploads run

> ./glacierabort -v myvault

treehash

The script calculates the top level hash of a Merkel tree (tree hash) built from equal sized chunks of a file.

If possible, i.e. if multiple CPUs are available on your system, the script parallelizes the computation of the tree hash.

The script does not depend on any of the other scripts in this repository and can be used stand-alone.

Prerequisites

This script depends on parallel and openssl. If you are on Mac OS X and are using Homebrew, then run the following:

brew install openssl
brew install parallel

Script Usage

treehash [-b|--block <size>] [-a|--alg <alg>] [-v|--verbose <level>] <file>


-b --block       size of the leaf data blocks in bytes, defaults to 1M.
                 can be postfixed with K, M, G, T, P, E, k, m, g, t, p, or e,
                 see the '--block' option of the 'parallel' command for details.
-a --alg         hash algorithm to use, defaults to 'sha256'. Supported
                 algorithms are the ones supported by 'openssl dgst'
-v  --verbosity  print diagnostic messages to stderr if level is larger than 0:
                  * level 1: Print the entire tree
                  * level 2: Print debug information
-h --help        print help message

The script does not create any temporary files nor does it require that the chunks of the file are present as files on the disk.

Examples

To calculate the tree hash of /path/to/my/archive with a chunk size of 1MB and the sha-256 hash algorithm use

> ./treehash /path/to/my/archive

References

  • O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

glaciertools's People

Contributors

aj-enns avatar jtai avatar numblr avatar o-darek avatar spartantri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

glaciertools's Issues

Won't upload all files in directory

I have a folder with sub folders and files in root as well as each sub folder.
For instance I have 9 files in /folder/subfolder/ all of them are 1MB to 68GB big. For some reason when I am trying to upload them to the Glacier, only one file is uploaded - always the same which is 761MB big. I tried to adjust chunk size but with no change. The command I am using is:
./glacierupload -p default -v vaultname -s 8 /folder/subfolder/*

Remove dependency on jq

We only need jq to parse the upload and archive ids, that should be rather straight forward with a regex as well and removing it gives is one dependency less.

Max level 12

I noticed that max part size parameter can be 12 not 22. Otherwise
"An error occurred (InvalidParameterValueException) when calling the InitiateMultipartUpload operation: Invalid part size: 17179869184. Part size must not be null, must be a power of two and be between 1048576 and 4294967296 bytes."
I think the problem is here:
Line 16: MB=1048576
...
Line 123: local -r part_size=$(( 2**split_size * MB ))

so 1048576*2^split_size=4 398 046 511 104 (for 22) which is way too much (4 294 967 296)
I may correct that,

Retry and stop on failing parts

If the upload of a part fails the glacieruploads script still continues to upload the remaining parts and fails only when completing the upload.

Thoughts:
Try checking the status of the cli command in the upload_part function

aws "${upload_args[@]}"

success=$?
  if (($success > 0)); then
    # echo "Upload of range $range failed"
    exit $succes
  fi

and check the --retries 3 --halt now,fail=1 options for the parallel command (this doesn't work on the first try though).

Question rather than issue

What will happen if I run your script with option to split to 4GB chunks against a folder containing large files >4GB and files <4GB? Will it fail on 'small' files and continue splitting large ones? Let's say that I have following folder structure:

/data/files/x/
/data/files/y/
/data/files/

all of them contains files from 300MB to 100GB big. Ideally I'd like to run script in tmux/screen session and check once a day.

Support input from stdin to glacierupload

It would be really nice if glacierupload accepts stdin as input, then the archive does not need to be present on disk but can be e.g. created by tar on the fly, like
tar mydir | glacierupload -v myvault

This should be easy as stdin can be passed to parallel's -a option by using parallel -a - (see the tutorial )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.