e3sm-project / zstash Goto Github PK

View Code? Open in Web Editor NEW

7.0 114.0 9.0 11.47 MB

Long term HPSS archiving tool for E3SM

License: BSD 3-Clause "New" or "Revised" License

Python 98.14% Shell 1.86%

zstash's Introduction

zstash

zstash is an HPSS long-term archiving tool for E3SM.

See documentation for more details.

License

SPDX-License-Identifier: (BSD-3-Clause)

See LICENSE for details

Unlimited Open Source - BSD 3-clause Distribution LLNL-CODE-819717

zstash's People

Contributors

Stargazers

Watchers

Forkers

zshaheen forsyth2 xylar tomvothecoder tangq lukaszlacinski mahf708 tylern4 kaizhangpnl

zstash's Issues

Print version number

Add option to print zstash version number. At a minimum as part of

zstash -h

Add option to work without HPSS

E3SM dedicated machines (anvil, compy) do not have HPSS storage attached to them. It is currently up to individual users to decide on the best way to free disk space when needed. This is inefficient and could lead to unfortunate data loss.

One remedy would be to implement an option for zstash to work on systems without HPSS. This could take two forms:

Only create local tar archives in the zstash subdirectory. The user could then transfer tar and index.db files for offline storage.
Implement a remote HPSS option using globus.

The command line usage might look something like:

zstash create --hpss=none ... for local only

zstash create --hpss=globus:nersc-hpss/... ... for remote HPSS using globus.

In terms if implementation, this may be reasonably straightforward. The bulk of the changes would consist of generalizing

https://github.com/E3SM-Project/zstash/blob/master/zstash/hpss.py

to handle different backends.

How to determine the version of zstash being used?

Is there a way to determine the version of zstash being used?

This can be particularly useful to (i) know that the version being used is the latest, and (ii) reporting errors.

I could not find this in the documentation, and guessing at arguments, such as --version, didn't work for me.

Add '--keep' option for update

Add '--keep' command line option for update.
If command line option is not specified, use preference stored in the database (this is what should be happening now, but may not be working properly).

Remove keep from config

Remove keep from the config. Doing this will mean tar files won't be kept or removed based on using the keep option in an earlier command. (Although it doesn't seem like this is the current behavior anyway).

zstash check deleting local tar file

I thought we fixed this, but maybe not completely or it came back?

I'm running zstash on compy (without HPSS). zstash create worked fine, but zstash check deletes the files after checking is done. Here is what I did:

module load anaconda3/2019.03
source /share/apps/anaconda3/2019.03/etc/profile.d/conda.sh
conda activate zstash_env
cd /compyfs/gola749/E3SM_simulations/20191216.alpha20.piControl.ne30_r05_oECv3_ICG.compy
mkdir zstash
zstash create --hpss=none  --maxsize 128 . 2>&1 | tee zstash/zstash_create_20200224.log
zstash check --workers=2 2>&1 | tee zstash/zstash_check_20200226.log

I wonder whether it has to do with the fact that I'm not specifying --hpss=none on the command line the second time around, instead relying on what is stored in the database. We need to fix this urgently and make sure we have tests catch this.

Add checksums for tar files

In a use case where a user needs to work on a remote machine on different sets of files (unknown upfront) extracted the same zstash tarballs, the user has to extract all files and transfer them to the remote machine. It will be more efficient to transfer the entire zstash tarballs once and extract needed files on the remote machine. Unfortunately, the database does not provide checksums of the zstash tarballs to make sure that the tarballs are transferred from HPSS to a local machine, and then from the local machine to the remote machine without any errors.
It will be helpful to add checksums of the zstash tarballs to the database.

Add message when check completes successfully

If a failure is encountered during zstash check, zstash will list files with problems at the end. If zstash check completes successfully, nothing is printed. Add a message stating that zstash check completed successfully.

Exclude a directory when it ends with "/"

I tested --exclude="dir/". The files below dir/ were still archived. For typical linux uses, when "/" is added, it is expected to affect all the content in that folder.

A possible solution is to add "*" to the exclude strings which end with "/".

A related suggestion: use --exclude="archive/rest/???[!05]-*/" to only archive restart files every 5 years, which will save a lot of time and space.

Revisit what's stored in 'config' table for greater portability of archives

In the index.db database, there is a table called 'config' that stores some configuration parameters. For example:

sqlite3 index.db
sqlite> select * from config;
keep|0
maxsize|274877906944
path|/global/cscratch1/sd/golaz/ACME_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison
hpss|2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison

Storing 'path' and 'hpss' makes archives not very portable. Can we get rid of these fields for new archives and never read them for existing archives?

The advantage of storing 'hpss' is that it allows for shortened usage once the index.db is on disk. For example:

zstash ls post/atm/fv129x256/clim/100yr/*_0001??_0100??_climo.nc

instead of

zstash ls --hpss='/home/g/golaz/2018/E3SM_simulation//20180129.DECKv1b_piControl.ne30_oEC.edison' post/atm/fv129x256/clim/100yr/*_0001??_0100??_climo.nc

The long version is a pain for repeated uses of zstash. To keep the shortcut, maybe we could implement it by storing the hpss path in an ascii file under the local zstash/ directory, but not on HPSS.

The logic would be as follows:

If a user invokes zstash without the --hpss option, look for the ascii file 'zstash/hpss.path'. If it's there, read the hpss path from the file. If not, throw an error.
If a user invokes zstash with the --hpss option, create the ascii file 'zstash/hpss.path' if it's not already there.

Possible implementation step:

Modify zstash to no longer save and read 'hpss' and 'path' to the config table. Make sure nothing breaks.
Implement alternate caching mechanism with local ascii file 'zstash/hpss.path'.

Questions for @zshaheen:

Does this make sense to you?
Do you think it would break anything?
Should we also remove the 'keep' option from config?

Make 'zstash extract' smarter

Make 'zstash extract' smarter and faster. If the target file is already on disk, with the right size and modification time stamp, skip extraction. Maybe add a '--force' or '--all' option to override this new default behavior.

'update' archives files that haven't changed

On occasions, running zstash update will archive files that haven't really changed. Sometimes the timestamp of the files differs by only one second:

sqlite3 zstash/index.db "select * from files where name is 'archive/ocn/hist/mpaso.hist.0203-11-01_00000.nc';"
45719|archive/ocn/hist/mpaso.hist.0203-11-01_00000.nc|4105024784|2017-12-22 14:45:08|3dc6348021936d6f465cd57837cef2b8|00003e.tar|65680408576
191882|archive/ocn/hist/mpaso.hist.0203-11-01_00000.nc|4105024784|2017-12-22 14:45:07|3dc6348021936d6f465cd57837cef2b8|0000a1.tar|254511583232

It is not obvious why the reported timestamp may differ by one second (NERSC Edison change?, rounding issue?). But maybe we could relax the timestamp test to add a tolerance of one second?

Current relevant code:

if (size_new == size) and (mdtime_new == mdtime):
    # File exists with same size and modification time
    new = False

Add --extract-dir option

Add --extract-dir option. See #43 for example of proposed usage.

zstash update --hpss=none not working as expected

I created a local zstash of a run on Anvil:

zstash create --hpss=none .

with the idea of moving it to HPSS on NERSC via globus. But I wasn't using screen and I got disconnected part way through the stashing.

I'm now running:

zstash update --hpss=none

and getting:

INFO: Gathering list of files to archive
INFO: Creating new tar archive 000000.tar
Traceback (most recent call last):
  File "/home/xylar/miniconda3/envs/zstash/bin/zstash", line 8, in <module>
    sys.exit(main())
  File "/home/xylar/miniconda3/envs/zstash/lib/python3.7/site-packages/zstash/main.py", line 45, in main
    update()
  File "/home/xylar/miniconda3/envs/zstash/lib/python3.7/site-packages/zstash/update.py", line 156, in update
    failures = add_files(cur, con, itar, newfiles)
  File "/home/xylar/miniconda3/envs/zstash/lib/python3.7/site-packages/zstash/hpss_utils.py", line 29, in add_files
    tar = tarfile.open(os.path.join(CACHE, tfname), "w")
  File "/home/xylar/miniconda3/envs/zstash/lib/python3.7/tarfile.py", line 1611, in open
    return cls.taropen(name, mode, fileobj, **kwargs)
  File "/home/xylar/miniconda3/envs/zstash/lib/python3.7/tarfile.py", line 1621, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/home/xylar/miniconda3/envs/zstash/lib/python3.7/tarfile.py", line 1436, in __init__
    fileobj = bltn_open(name, self._mode)
PermissionError: [Errno 13] Permission denied: 'zstash/000000.tar'

It looks like I will need to start over but I would like to have this work in the future.

zstash ls --help incomplete

Very minor (add to issue #65) but the usage for "zstash ls --help" says

usage: zstash ls []

but should say

usage: zstash ls [] [files]

in line with the "zstash extract help"

(base) -bash-4.1$ zstash version
v0.4.1
(base) -bash-4.1$ zstash ls --help
usage: zstash ls []

List the files from an existing archive

positional arguments: < - - - but in what position?
files

optional arguments:
-h, --help show this help message and exit

optional named arguments:
--hpss HPSS path to HPSS storage
-l show more information for the files
-v, --verbose increase output verbosity

chown error

Reported by Tony Bartoletti:

We now have

ls -l /p/user_pub/e3sm/archive/1_0/20180316.DECKv1b_A1.ne30_oEC.edison/

total 0

drwxr-xr-x 2 bartoletti1 climate 4096 Feb 26 10:29 zstash

ls -l /p/user_pub/e3sm/archive/1_0/20180316.DECKv1b_A1.ne30_oEC.edison/zstash

-rw-r--r-- 1 bartoletti1 climate 274314434560 Feb 25 10:23 000000.tar
-rw-r--r-- 1 bartoletti1 climate 274314434560 Feb 25 09:48 000001.tar
-rw-r--r-- 1 bartoletti1 climate 274314434560 Feb 25 09:48 000002.tar
-rw-r--r-- 1 bartoletti1 climate 274314434560 Feb 25 10:23 000003.tar
-rw-r--r-- 1 bartoletti1 climate 274314434560 Feb 25 10:05 000004.tar
-rw-r--r-- 1 bartoletti1 climate 274314434560 Feb 25 10:04 000005.tar
-rw-r--r-- 1 bartoletti1 climate 274792560640 Feb 25 10:44 000006.tar
-rw-r--r-- 1 bartoletti1 climate 274853683200 Feb 25 10:44 000007.tar
-rw-r--r-- 1 bartoletti1 climate 274800998400 Feb 25 11:04 000008.tar
-rw-r--r-- 1 bartoletti1 climate 274793349120 Feb 25 11:04 000009.tar
-rw-r--r-- 1 bartoletti1 climate 274737991680 Feb 25 11:20 00000a.tar
-rw-r--r-- 1 bartoletti1 climate 274854328320 Feb 25 11:20 00000b.tar
-rw-r--r-- 1 bartoletti1 climate 269865400320 Feb 25 11:36 00000c.tar
-rw-r--r-- 1 bartoletti1 climate 265029734400 Feb 25 11:36 00000d.tar
-rw-r--r-- 1 bartoletti1 climate 1587200 Feb 25 08:58 00000e.tar
-rw-r--r-- 1 bartoletti1 climate 2920448 Feb 25 08:58 index.db

and now (in the original directory 20180316.DECKv1b_A1.ne30_oEC.edison), the command

           zstash  ls  *cam.h4* | more

produces

archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1870-01-01-00000.nc
archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1870-01-31-00000.nc
archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1870-03-02-00000.nc
archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1870-04-01-00000.nc
archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1870-05-01-00000.nc
archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1870-05-31-00000.nc
archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1870-06-30-00000.nc

I know of no other way we can use zstash to extract the desired files. AFTER the actual extraction

           zstash extract *cam.h4* | more

ls -l /p/user_pub/e3sm/archive/1_0/20180316.DECKv1b_A1.ne30_oEC.edison/

total 0

drwxr-xr-x 3 bartoletti1 climate 4096 Feb 26 10:57 archive
drwxr-xr-x 2 bartoletti1 climate 4096 Feb 26 10:29 zstash

where archive holds archive/atm/hist/(the desired files)

It all appears to work - but one annoying error cropped up.

Each of the files extracted pushed the following “ERROR” to stderr:

INFO: Extracting archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1934-10-08-00000.nc

Traceback (most recent call last):

File "build/bdist.linux-x86_64/egg/zstash/extract.py", line 345, in extractFiles
tar.chown(tarinfo, fname, numeric_owner=False)
TypeError: chown() got an unexpected keyword argument 'numeric_owner'
ERROR: Retrieving archive/atm/hist/20180316.DECKv1b_A1.ne30_oEC.edison.cam.h4.1934-10-08-00000.nc

This “ERROR” (thrown a few thousand times) needs to be shared with the “zstash team” - perhaps limited to the recent release. It does not appear to reflect any issue with the file integrity.

The following tar files had errors:

ERROR: The following tar archives had errors:
ERROR: 000009.tar
ERROR: 00000a.tar
ERROR: 00000c.tar
ERROR: 00000d.tar

As far as I can tell, these are ALL due to the “numeric owner” issue, and actually reiterated ALL 1766 files that were extracted to archive/atm/hist.

When running with multiprocessing, when piping the output, the output is out of order.

When running the tests with multiprocessing, when piping the output of the processes to Python's subprocess.PIPE it was not in order. When I didn't pipe the output, everything printed out fine. I assumed it was something with how piping worked with Python. Most users wouldn't run zstash with a Python subprocess, so I ignored it.

However, even when using the shell to pipe the output, the same issue is present. Thanks @golaz for discovering this. There is a test case here that I'll try getting to work with the piping.

Suggestion: Modify zstash check, to state if the checks are all OK.

Suggestion: Modify zstash check, to have a clear statement at the end if the checks are all OK.

All I got at the end of my 'zstash check' was:

DEBUG: Closing tar archive zstash/000006.tar
DEBUG: Closing index database

Set tests to run on CSCRATCH on Cori

The following refers to running the Zstash tests on Cori:

We found that running the tests from $HOME may cause a "Resource temporarily unavailable" error. Running on $CSCRATCH fixes this issue; however this directory is purged is every 12 weeks. As such, a clone of the Zstash repo located there would get removed if not used after 12 weeks (or any files not used in the repo would be removed).

Would keeping the tests in a repo on $HOME but running them from $CSCRATCH work?

If so, then we could change the script to cd to $CSCRATCH and run the tests from there.
If not, then we could change the script to copy the entire repo to $CSCRATCH, cd to $CSCRATCH, and then run the tests from there.

Suggestion: Document that zstash create won't create a sub-directory

Suggestion: In documentation for zstash create: mention that it won't create a sub-directory in the directory specified by --hpss, ie the user must include the sub-directory they want in the --hpss specification.

New 'add' functionality

It might be useful to implement a new 'add' functionality to add specific files into an existing archive.

Syntax would be something like:

zstash add --hpss=<hpssDir> file1 file2 ...

Prompted for login info when running hsi at Livermore Computing (LC)

When trying to run zstash on the LC machines, we get prompted for a Principal and OTP. This isn't ideal and I'll see if there's a way around this.

'ls' works without the /archive at the end of the path, but then the 'extract' will fail

When listing a set of files with an hpss path such as "/home/m/maltrud/E3SM/20181217.BCRC_CNPCTC20TR_OIBGC.ne30_oECv3.edison," zstash will happily list out the contents of the archive, but then if you try and run the same command with "extract" it will fail (it works fine if you use the path /home/m/maltrud/E3SM/20181217.BCRC_CNPCTC20TR_OIBGC.ne30_oECv3.edison/archive). Zstash should probably either:

a) not list the files without the /archive, or
b) extract the files without the /archive.

Extracting Simulations on NERSC

Hi,
I'm trying to extract high resolution simulations stored on NERSC:
/global/cscratch1/sd/acmedata/E3SM_simulations/theta.20180906.branch_noCNT.A_WCYCL1950S_CMIP6_HR.ne120_oRRS18v3_ICG/

into my scratch space:
/global/cscratch1/sd/eroesler/e3sm_viz

Following the github io, I've loaded the latest e3sm unified environment on NERSC.
I try to query the zstash log for where the files are via zstash ls $path_to_data, but I get errors. I've tried pointing with the flag --hpss, too.

Implement verbosity command line option

This should be easy. Implement a new command line option to control verbosity of zstash (i.e. select between INFO, DEBUG, etc.).

Add no-hpss information to zstash -h

zstash create -h gives information about options for this command. It should be specified here (and perhaps in other help pages) that --hpss=none should be used if you do not have HPSS available.

v0.3 release

Test, document, release. Ask to incorporate into E3SM unified environment.

Improve stopping of Zstash

Ctrl-c does not always work to kill the Zstash process. Sometimes, users have to find the process id and use the kill command. It would be ideal if ctrl-c would behave as expected.

Delete tar option

I feel it is useful to have the delete tar file option when extract or check with zstash. This function is particularly useful for simulations with large data size. I recently ran an ensemble simulations. Each member has over 30 tar files. Checking them almost used up all my scratch space.

Support for replicas

I think that it's possible for zstash to support another HPSS repository/directory
to store data in. This would greatly help in diagnosing/averting the data
corruption problem that I'm guessing is still exhibited.

Current Schema

`config` Table
[PK] arg (text)
value (text)

`files` Table
[PK] id (int)
name (text)
size (int)
mtime (timestamp)
md5 (text)
tar (text)
offset (int)

Modifications

I think that the another config should be added to the config table, like replica_dir, which is just another HPSS path. If I recall, NERSC has another HPSS directory that ensures that the data is stored on a physically separated tape. You'll need to verify this. I'm not sure about other sites.

Then you can expand the files table to include replica_md5 and replica_tar columns or something. They are the MD5 hash of the file, as well as the tar that's in replica_dir.

`files` Table
[PK] id (int)
name (text)
size (int)
mtime (timestamp)
md5 (text)
tar (text)
offset (int)
replica_md5 (text)
replica_tar (text)

Examples

Writing

When writing, we began with 000000.tar and enumerated ascending.
To create the replica_tar, we can start from FFFFFF.tar and descend. This differentiation is needed cause all of the tars are stored in a single folder.

So some sample rows would be:

1|something.nc|<size>|<mtime>|<md5>|000000.tar|<offset>|<replica_md5>|FFFFFF.tar
2|something_else.nc|<size>|<mtime>|<md5>|000000.tar|<offset>|<replica_md5>|FFFFFF.tar
3|something_else_too.nc|<size>|<mtime>|<md5>|000000.tar|<offset>|<replica_md5>|FFFFFF.tar
4|something2.nc|<size>|<mtime>|<md5>|000001.tar|<offset>|<replica_md5>|FFFFFE.tar
5|something2_else.nc|<size>|<mtime>|<md5>|000001.tar|<offset>|<replica_md5>|FFFFFE.tar

Reading

If during check/extract there's an MD5 issue, you can try to get the file from the replica_tar and see if that's valid. Again, the replica tar is stored at another HPSS dir. The replica path is in the config table. This is the main use case of this entire feature.

Command Examples

Create

We need to have another argument --replica_hpss.

zstash create --hpss=<path to HPSS> --replica_hpss=<path to a replica HPSS dir> <local path>

Check/Extract

In the command below, the hpss argument could support both the original HPSS dir (--hpss) as well as the replica one (--replica_hpss) since in theory, they should be interchangeable.
And you'd don't want users (who download the data) to really mess around with the replicas, this feature should be abstracted away from them and just "work".

$ zstash check --hpss=<path to HPSS> [--workers=<num of processes>] [files]

Update

Like check/extract above, both of the HPSS paths can be used for --hpss.

List

List should actually just list what's in the --hpss. So replica HPSS dirs should show what they have, and same with the "regular" HPSS dirs.

Again, this is just an idea. So if it's not useful, please feel free to close this issue.

sqlite3.OperationalError: disk I/O error on NERSC cfs file systetm

On NERSC cfs system, there was an error when running zstash on the NERSC cfs file system. The reason should be caused by the file system. Will it be possible to let users define the location of zstash directory instead of the default one under local path?

DEBUG: Running zstash create
DEBUG: Local path : /global/cfs/cdirs/cmip6/temp_trans/20191123.CO21PCTRAD_RUBISCO_CNPCTC20TR_OIBGC.I1900.ne30_oECv3.compy
DEBUG: HPSS path  : E3SM_simulations/20191123.CO21PCTRAD_RUBISCO_CNPCTC20TR_OIBGC.I1900.ne30_oECv3.compy
DEBUG: Max size  : 274877906944
DEBUG: Keep local tar files  : False
DEBUG: Making sure input path exists and is a directory
DEBUG: Creating target HPSS directory
DEBUG: Making sure target HPSS directory exists and is empty
DEBUG: Creating local cache directory
DEBUG: Creating index database
Traceback (most recent call last):
  File "/global/cfs/cdirs/acme/software/anaconda_envs/base/envs/e3sm_unified_1.3.0/bin/zstash", line 11, in <module>
    load_entry_point('zstash==0.3.0', 'console_scripts', 'zstash')()
  File "/global/cfs/cdirs/acme/software/anaconda_envs/base/envs/e3sm_unified_1.3.0/lib/python3.7/site-packages/zstash-0.3.0-py3.7.egg/zstash/main.py", line 46, in main
  File "/global/cfs/cdirs/acme/software/anaconda_envs/base/envs/e3sm_unified_1.3.0/lib/python3.7/site-packages/zstash-0.3.0-py3.7.egg/zstash/create.py", line 115, in create
sqlite3.OperationalError: disk I/O error

Ensure all optional arguments are tested on and off

Some zstash optional arguments are either always or never tested. It might be good to add some tests where the always-on options are off and vice-versa.

After #61 merges, these options will be as follows:
Always on:

chgrp: -R
extract: hpss
ls: hpss
update: hpss

Always off:

check: --keep

Question on NERSC Best Practices

In looking over https://e3sm-project.github.io/zstash/docs/html/best_practices.html , it seemed to me that a better approach on NERSC is to submit a job using zstash to the transfer queue:
https://docs.nersc.gov/filesystems/archive/#use-the-xfer-queue

Has that approach been tested?

zstash check creates empty directory structure

Minor bug that should be easily fixed. When running zstash check, zstash still creates the output directory structure. I think it's a matter of relocating if statement in extract.py near line 139:

                    path, name = os.path.split(fname)
                    if path != '':
                        if not os.path.isdir(path):
                            os.makedirs(path)
                    if keep_files:
                            fout = open(fname, 'w')

os.makedirs(path) should only be invoked if keep_files is True.

Increase width of center pane on documentation pages

c7220c0 was an attempt to increase the maximum width of the center pane on the documentation pages following the top suggestion on:

https://stackoverflow.com/questions/23211695/modifying-content-width-of-the-sphinx-theme-read-the-docs

It works when the pages are viewed locally, but not when they are served by Github. Will need to investigate further.

Refactor to remove code duplication

There's a fair amount of code duplication in the functional code and especially the tests that can be replaced with helper functions.

Performance improvement (background, parallel hsi get)

To improve performance of zstash extract, execute multiple hsi get in parallel (with user control flag to set max number) and in the background rather than proceeding purely sequentially as zstash now does:

Retrieve the first relevant tar file from HPSS
Extract relevant files
Repeat with the next tar file.

Update docs for --cache option

After #57 merges, update the documentation to reflect the --cache option.

zstash "ls" counterintuitively overloaded

I issue zstash ls archive/atm/hist/*cam.h0.1870*, and these are found in the tar files.
I issue zstash ls archive/atm/hist/*cam.h0.1871*, and these are also found in the tar files.
I issue zstash extract archive/atm/hist/*cam.h0.1870*, and these are extracted locally to archive/atm/hist.

I issue zstash ls archive/atm/hist/*cam.h0.1871*, and these are NO LONGER found. Why?
I issue zstash ls archive/atm/hist/*cam.h0*, and only the previous 1870 files appear. Why?

Because zstash magically behaves differently when the local archive exists, and will no longer address the tarfiles. I need to rename the local archive to force zstash to behave as before.

There is no reason to have zstash replicate the system "ls". The system "ls" works just fine.

Otherwise, we should provide a "tar_ls" to distinguish these.

Duplicated file fullpathnames with different checksums

I have seen duplicated tar files (full paths) once before, and assumed it was just a harmless listing issue, but this second time I have confirmed, using "zstash ls -l", that the duplications have different checksums. This renders extraction (potentially) unpredictable. Two archives exhibit this.

The archives in question are:
/p/user_pub/e3sm/archive/1_1_ECA/BGC/BCRC_CNPECACNT_20TR.ne30_oECv3.cori-knl
/p/user_pub/e3sm/archive/1_1_ECA/BGC/BCRD_CNPECACNT_20TR.ne30_oECv3.cori-knl

For the first set, it is the 2014 monthly
(ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-02-01_00.00.00.nc)
for all months except January. The "original" set appears to be in the 000001.tar file, and the duplicates-in-name appear in the 000004.tar and 000005.tar files. The "ls -l" output is given below.

For the second set, it appears to be the first 8 months of 2011, and January 2014 that have this duplication problem.

I am guessing that this feature is an artifact of the "zstash update" command, and (hopefully) zstash would know to extract only the latest of duplicates (assuming that was the author's intention). Ideally, an "update" would render the previous match un-listable as well as un-extractable. In any case, the "zstash update --help" gives no indication of the expected behavior when applied to either new (append) or existing (replace) material.

In the first case above, with (assumed) originals in 000001.tar, and the later duplicates elsewhere, would anyone who copied over only the 000001.tar and index.db, and attempted extraction of just one duplicated file receive

the original file, because it was the only one available?
an error message that tarfiles are missing?
a reply that there is no such file (referring only to the latest) in the archive?

(When I have the time, I will conduct such an experiment, with a simulated archive where the "zstash" subdirectory contains only symlinks to 000001.tar and the index.db file.)

I will contact the author (Jinyun) for insight, and perhaps revised archives.

Here is the "zstash ls -l" output from the first archive.

    **ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-02-01_00.00.00.nc**       151551336       2019-05-25 10:35:38     **ec7214db895b6f4dd91d333ee15b6c8d**        000001.tar      111445023744
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-03-01_00.00.00.nc       181651888       2019-05-25 11:18:12     cd8d46eff4f742eccd28f7f74a841482        000001.tar      111596575744
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-04-01_00.00.00.nc       181651888       2019-05-25 11:59:56     937c585eee7abb5c64b59979e41b9e20        000001.tar      111778228224
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-05-01_00.00.00.nc       211752440       2019-05-25 12:48:55     046349d6499bd8346fa42c62198bd8a6        000001.tar      111959880704
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-06-01_00.00.00.nc       181651888       2019-05-25 13:31:19     2f6b7e8e893b58fc4deb7b74701a2a15        000001.tar      112171633664
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-07-01_00.00.00.nc       181651888       2019-05-25 14:14:52     332448a6e3a71ca0ea699cdbf835df4d        000001.tar      112353286144
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-08-01_00.00.00.nc       181651888       2019-05-25 14:56:59     1219cef89654ea992df7a4a52cc67fd0        000001.tar      112534938624
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-09-01_00.00.00.nc       181651888       2019-05-25 15:39:08     6e9652c1485e3b081681f25c688c79d1        000001.tar      112716591104
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-10-01_00.00.00.nc       181651888       2019-05-25 16:20:57     5195c4073a802259f0d4310a3b55f476        000001.tar      112898243584
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-11-01_00.00.00.nc       181651888       2019-05-25 17:02:41     4b1d1686fd5c2e3be97162451a13bcb7        000001.tar      113079896064
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-12-01_00.00.00.nc       91350232        2019-05-25 17:23:52     9d1700b296b0e8cf0020d54121fc5179        000001.tar      113261548544
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-01-01_00.00.00.nc       181651888       2019-05-26 05:29:58     977dd44e3cdb5d4b488ad882a4edb87f        000004.tar      23265457664
    **ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-02-01_00.00.00.nc**       151551336       2019-05-26 06:09:37     **e006c58cf60325e4138f2e2ed374fc76**        000004.tar      23447110144
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-03-01_00.00.00.nc       181651888       2019-05-26 06:59:09     f5812aa6ae3790eec34accf16c33e50f        000004.tar      23598662144
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-04-01_00.00.00.nc       181651888       2019-05-26 07:49:17     54d2e47b33c97a18005bb8aafea4b475        000004.tar      23780314624
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-05-01_00.00.00.nc       211752440       2019-05-26 08:44:12     d67887cec712e91254ae3098ad20e6d2        000004.tar      23961967104
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-06-01_00.00.00.nc       181651888       2019-05-26 09:32:19     f0567a0545176ef2aaaebd45f31108f5        000004.tar      24173720064
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-07-01_00.00.00.nc       181651888       2019-05-26 17:03:12     2d5396e7968b0baac9bbb1124bccc341        000005.tar      23480672256
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-08-01_00.00.00.nc       181651888       2019-05-26 17:49:11     61dc0a2bf2f7fa93de3be74cfd474316        000005.tar      23662324736
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-09-01_00.00.00.nc       181651888       2019-05-26 18:36:37     3c60a6a6bb6f5ef416df98cb513244dc        000005.tar      23843977216
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-10-01_00.00.00.nc       181651888       2019-05-26 19:23:20     44b7414c49fb24fdbc0d3361800ebfce        000005.tar      24025629696
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-11-01_00.00.00.nc       181651888       2019-05-26 20:11:34     d18480e90708b26cab70ca3c1300fcc2        000005.tar      24207282176
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2014-12-01_00.00.00.nc       181651888       2019-05-26 20:59:13     133d29424454569fc4f8d3eef11cbc4b        000005.tar      24388934656
    ocn/hist/mpaso.hist.am.highFrequencyOutput.2015-01-01_00.00.00.nc       31149128        2019-05-26 21:07:15     b8edbb9c0adaa724838253973b43b7ad        000005.tar      24570587136

    Conclusion:  For the archive:  /p/user_pub/e3sm/archive/1_1_ECA/BGC/BCRC_CNPECACNT_20TR.ne30_oECv3.cori-knl
                 Identically-named files (with SAME tar-path) appear in different tar-files, but with different checksums.

Improve robustness

From @tangq: what happens if the archiving job is killed while running?

If zstash is in the process of adding a new file to the tar archive, we would have a tar file with an incomplete file at the end. Not a disaster, the tar file should still be readable and the database would not know about the incomplete file. Restarting with 'update' would add the missing file to the next tar archive. However, the interrupted tar file would not get transferred to HPSS and would be at risk of being lost.
If zstash was in the process of transferring a tar file to HPSS: not sure. Would HPSS know that it only received a partial file?

Possible safer solution: update database only when the tar file is complete and has been transferred to tape.

md5sum vs hashlib

I recently had to hash a large number of mpaso files, and found that using the "md5sum" cli tool was about 3x faster then the hashlib method used by zstash. Not a huge difference since most of the time zstash is waiting on hsi, but it would be fairly simple to replace the current hashlib method with calls to the md5sum utility.

Update Zstash to run on macOS

The command display_mode = "stat --format '%a' {}".format(file_path).split() added in hpss.py in #42 only works on Linux. There needs to be a different command for macOS.

Improve Zstash extract warnings

Print a short explanation on the screen if the match doesn't return any files.

HPSS path is not flexible when extracting archived files

The extract command ignores the HPSS path passed to it and still looks for archived files at the HPSS path used in the create command.

I noticed this because I renamed my HPSS directory from ACME to E3SM. When extracting, it still tries to find the archive under the ACME directory.

Can we use the HPSS path supplied to the extract command instead of the original one?

Option to specify alternate location for local zstash files

Currently, zstash will store local copies of database and tar files under the sub-directory zstash/.

In certain cases (see for example #41), it might be useful to have the flexibility to specify and alternate location, maybe with a new command line argument:

--cache=...

Is this worth considering implementing? Or would it just add unnecessary confusion?

Add support for checking of local, extracted files

If you run zstash check or even zstash extract, it'll work with tars that were previously downloaded. We want to use this on the extracted files from these tars as well. Maybe do this via an --check_extracted_files parameter or something similar.

This will be especially useful when fixing the broken files for the v1 DECK output. We'll recreate the files by running the model again, and check these new files against what we had previously. If all is good, we can proceed with uploading the fixed data.

Documentation versions

Find a way to have multiple versions of documentation online. Suppose version n is the latest release but master has more commits and the docs have been updated accordingly. Currently, users would see the latest docs online even though the docs describe features not included in the version they are using.

Possible solution would be to just have the index.html point to a bulleted list of docs (v1, v2, master/merged-but-not-released-yet).

We can also look into updating the documentation in master instead of on gh-pages. This would be nice because developers could update code and documentation in the same PR. However, it wouldn't solve the primary problem of needing to serve multiple versions of the docs online (unless we had users refer to the docs included in the release they download rather than looking online).

zstash update error recovery

my latest zstash update failed due to an I/O error:

INFO: Transferring file to HPSS: zstash/000032.tar
Traceback (most recent call last):
File "/global/project/projectdirs/acme/software/anaconda_envs/edison/base/envs/e3sm_unified_1.2.4_py2.7_nox/bin/zstash", line 11, in
load_entry_point('zstash==0.1.0', 'console_scripts', 'zstash')()
File "build/bdist.linux-x86_64/egg/zstash/main.py", line 43, in main
File "build/bdist.linux-x86_64/egg/zstash/update.py", line 146, in update
File "build/bdist.linux-x86_64/egg/zstash/utils.py", line 85, in addfiles
sqlite3.OperationalError: disk I/O error

it may have been a quota issue (i'm not sure), but my question is how do i recover from this? just restart zstash update?

max-size for 'update' based on 'create'

Max size value specified during 'create' should carry over to 'update' phase.