neicnordic / endit Goto Github PK

View Code? Open in Web Editor NEW

6.0 7.0 1.0 507 KB

Efficient Northern dCache Interface to TSM

License: GNU General Public License v3.0

Perl 100.00%

dcache spectrum-protect tsm-server

endit's Introduction

ENDIT - Efficient Northern Dcache Interface to TSM

ENDIT daemons

Concept

Use the same filesystem as an HSM staging area, using hardlinks to "store" data and then use batch processes to archive and restore data to/from tape.

ENDIT is comprised of an ENDIT dCache plugin and the ENDIT daemons.

The IBM Storage Protect (Spectrum Protect, TSM) client is used to perform the actual transfer of files to/from the tape system.

Requirements

The ENDIT daemons are known to work on Perl 5.10 onwards.

At least the following Perl modules need to be installed:

JSON
- libjson-perl (deb), perl-JSON (rpm)
JSON::XS (approx 100 times faster parsing compared to pure-perl JSON)
- libjson-xs-perl (deb), perl-JSON-XS (rpm)
Schedule::Cron (optional, allows for crontab style specification of deletion queue processinginterval)
- libschedule-cron-perl (deb), perl-Schedule-Cron (rpm)
Filesys::Df
- libfilesys-df-perl (deb), perl-Filesys-Df (rpm)

A recent version of the IBM Storage Protect (TSM) client is recommended, as of this writing v8.1.11 or later, due to commit 796a02a and IBM APAR IT33143.

Installation and Configuration

All dCache tape pools needs both the ENDIT dCache plugin and the ENDIT daemons installed.

If needed, more verbose instructions are available in the NeIC wiki at https://wiki.neic.no/wiki/DCache_TSM_interface

TSM (IBM Storage Protect)

Setup TSM so that the user running dCache can dsmc archive and dsmc retrieve files. If you want to have several pool-nodes talking to tape, we recommend setting up a TSM proxy node that you can share across machines using dsmc -asnode=NODENAME. Due to recent changes in TSM client authentication we strongly recommend not using a machine-global TSM node, but instead creating a dedicated TSM node for each dCache runtime user. See the IBM documentation re non-root usage for the recommended setup.

A dCache hsminstance typically maps into a dedicated TSM proxy node. With a proxy node you can have multiple read and write pool nodes to the same data in TSM. Different TSM nodes need to have different hsminstances.

The common ENDIT-optimized TSM storage hierarchy setup is to have a dedicated domain for each proxy node and define a tape storage pool as the archive copygroup destination. Since tsmarchiver.pl batches archive operations into larger chunks there is limited benefit of spooling data to disk on the TSM server before moving it to tape.

For each TSM node defined on your TSM server, ensure that the following options are correct for your environment:

MAXNUMMP - Increase to the sum of concurrent/parallel dsmc archive and dsmc retrieve sessions plus some margin to avoid errors when tapes are concurrently being mounted/dismounted.
SPLITLARGEObjects - set to No to optimize for tape.

On your TSM client machine (ie. dCache pool machine), ensure that you have set the appropriate tuning options for optimizing performance in a tape environment, see the IBM documentation on Using high performance tape drives for further details. It is also recommended to define the out directory as a separate file system in TSM using the VirtualMountPoint configuration option.

Typical dsm.sys excerpt:

TXNBYTELIMIT      10G
VIRTUALMountpoint /grid/pool/out

We also recommend disabling ACL support as this is file system specific (as in you can't restore files to a different file system type) thus having it enabled makes it hard to change the setup in the future.

Typical dsm.opt excerpt:

SKIPACL YES

If the machine is running scheduled TSM backups you want to exclude the pool filesystem(s) from the backup.

Typical system include-exclude file excerpt:

exclude.dir     /grid/pool*/.../*
exclude.fs      /grid/pool*/.../*

dCache

The ENDIT dCache plugin needs to be installed on the pool.

To get good store performance the dCache pool must be tuned for continuous flushing.

To get any efficiency in retrieves, you need to allow a large number of concurrent restores and have a long timeout for them.

Note that since ENDIT v2 a late allocation scheme is used in order to expose all pending read requests to the pools. This minimizes tape remounts and thus optimizes access. For new installations, and when upgrading from ENDIT v1 to v2, note that:

The dCache pool size needs to be set lower than the actual file space size, 1 TiB lower if the default retriever_buffersize is used.
You need to allow a really large amount of concurrent restores and thus might need an even larger restore timeout. ENDIT has been verified with 1 million requests on a single tape pool with modest hardware, central dCache resources on your site might well limit this number.

The configuration of the ENDIT dCache plugin is done through the dCache admin interface.

ENDIT daemons

Download the ENDIT daemons to a directory of your choice, /opt/endit is our suggestion. To make future upgrades easier we recommend to clone directly from the GitHub repository.

Execute one of the daemons (for example tsmretriever.pl) once in order to generate a sample configuration file. When no configuration is found the ENDIT daemons will generate a sample file and write it to a random file name shown in the output, and then exit.

Review the sample configuration, tune it to your needs and copy it to the location where ENDIT expects to find it (or use the ENDIT_CONFIG environment variable, see below). The following items needs special attention:

dir - The pool base directory.
desc-short - Strongly recommended to set to match the dCache pool.name.

Starting from a generated sample configuration is highly recommended as it is the main documentation for the ENDIT daemon configuration file, and also contains an example on how to enable multiple session support for archiving and retrieving files. The multiple session archive support in tsmarchiver.pl adapts to the backlog, ie how much data needs to be stored to TSM, according to your configuration choices. The multiple session retrieve support in tsmretriever.pl requires a tape hint file, see below, that enables running multiple sessions each accessing a single tape.

On startup, the ENDIT daemons will check/create needed subdirectories in the base directory, as specified by the dir configuration directive in endit.conf.

After starting dcache you also need to start the three scripts:

tsmarchiver.pl
tsmretriever.pl
tsmdeleter.pl

See startup/README.md for details/examples.

By default the ENDIT daemons creates files with statistics in the /run/endit directory, tmpfiles.d can be used to create the directory on boot, here is an example /etc/tmpfiles.d/endit.conf snippet:

d /run/endit 0755 dcache dcache

Note that it's by design to have the directory and the statistics files world-readable, they contain no secrets and usually needs to be accessed by other processes such as the Prometheus node_exporter.

To enable concurrent retrieves from multiple tapes you must use a tape hint file, a file that provides info on which tape volume files are stored.

Tape hint file

The tape hint file name to be loaded by tsmretriever.pl is set using the retriever_hintfile specifier in the configuration file.

This file can be generated either from the TSM server side or from the TSM client side using one of the provided scripts. Choose the one that works best in your environment.

In general we recommend to run only one script per TSM proxy node name and distribute the resulting file to all affected hosts running ENDIT. Running multiple scripts works, but may put unnecessary strain on your TSM server database.

Updating the file by running the script daily is recommended.

`tsm_getvolumecontent.pl`

This method communicates with the TSM server. It has the following requirements:

The dsmadmc utility set up to communicate properly with the TSM server.
A TSM server administrative user (no extra privilege needed).

Benefits:

Volume names are the real tape volume names as used by TSM
Tests have shown this method to be approximately a factor 2 faster than using tsmtapehints.pl

Drawbacks:

More cumbersome to set up:
- Requires dsmadmc
- Requires close cooperation with TSM server admins due to admin user etc.
- Requires TSM admin user password in a clear-text file

`tsmtapehints.pl`

This method runs together with the other ENDIT daemons and uses the dsmc command as specified by the ENDIT configuration file to list file information.

Benefits:

Easier to set up:
- Uses the ENDIT configuration file
- Only needs periodic invocation (crontab, systemd timer)
Performs some sanity checking, in particular detection of duplicates of archived files (multiple tape copies of the same file object)

Drawbacks:

Volume names are numeric IDs, good enough to group files correctly but not easily usable by a TSM admin to identify a specific tape volume in case of issues.
Slower, tests have shown a factor of 2 slowdown compared to tsm_getvolumecontents.pl

Multiple instances

To run multiple instances for different tape pools on one host, the ENDIT_CONFIG environment variable can be set to use a different configuration file. This is not to be confused with enabling parallel/multiple archive and retrieve operations for one pool which is done using options in the ENDIT daemon configuration file.

Bypassing delays/threshold/timers when testing

The ENDIT daemons are designed to avoid unnecessary tape mounts, and achieves this by employing various thresholds and timers as explained in the example configuration file.

However, when doing functional tests or error recovery related to the tape system it can be really frustrating having to wait longer than necessary. For these situations it's suitable to use the USR1 signal handling in the ENDIT daemons. In general, the USR1 signal tells the daemons to disregard all timers and thresholds and perform any pending actions immediately.

Temporary configuration overrides

It's possible to (temporarily) override select configuration items using a separate JSON configuration file.

This makes it possible for sites to automate some load balancing tasks, for example implementing backoff mechanisms for sites where lots of reads queued results in starving writes.

Since this is focused on on-the-fly automatic solutions, the configuration override file is a JSON file to make it easy to create it using whatever tool that's suitable for the job. It is assumed that the main endit.conf configuration file is under the control of some configuration management tool such as Puppet, Ansible, etc; and thus not suitable for on-the-fly manipulation.

The default file location chosen is /run/endit/conf-override.json with the motivation that overrides are temporary.

Statistics

The ENDIT daemons generate statistics in JSON and Prometheus node_exporter formatted files, by default in the /run/endit directory. The current implementation dumps the ENDIT internals unprocessed, sizes are generally GiB denoted by _gib in the metric name. The best documentation for now are the ENDIT daemon scripts, UTSL :-)

It is strongly recommended to set desc-short in endit.conf to match the dCache pool.name since this is used to tag metrics with supposedly unique hsm tags in order to be able to differentiate metrics on hosts running multiple pools.

When using node_exporter, the suggested implementation is to simply symlink the ENDIT .prom into your node_exporter directory.

Migration and/or decommission

When migrating ENDIT service to another host (typically when renewing hardware), ensure that pending operations have finished before shutting down ENDIT and the dCache pool.

Check the trash/ and trash/queue/ directories, they should both contain no files.
- If the trash/ directory has files in it, the dCache pool is getting deletion requests. Take actions to prevent this. tsmdeleter will queue the deletion requests on the next iteration cycle (default every minute).
- If the trash/queue/ directory has files in it, there are queued deletion requests. Either wait until the queue is processed (default once per month) or force queue processing by sending a USR1 signal to the tsmdeleter.pl process. Review the tsmdeleter.log for progress and double-check the trash/queue/ directory afterwards.
Check the out/ directory, it should not contain any files.
- If the out/ directory has files in it, data is being staged to the dCache pool. Take actions to prevent this. Either wait until tsmarchiver processes the staging queue (default up to 6 hours) or force staging by sending a USR1 signal to the tsmarchiver.pl process. Review the tsmarchiver.log for progress and double-check the out/ directory afterwards.

Collaboration

It's all healthy perl, no icky surprises, we hope. Patches, suggestions, etc are most welcome.

License

GPL-3.0, see LICENSE

Versioning

Semantic Versioning 2.0.0

Contributors

This project existed for approximately 10 years before it was added to GitHub, with contributions from:

Mattias Wadenstein [email protected]
Lars Viklund [email protected]
Niklas Edmundsson [email protected]

endit's People

Contributors

Stargazers

Watchers

Forkers

apetzo

endit's Issues

Add cputime limit to dsmc processes

dsmc has a few corner case bugs when it can get stuck in a loop consuming 100% cpu and never recover.

Work around this by adding a cputime limit to spawned dsmc:s, 48 cpuhours or so should be enough to not trigger in normal usecases.

Add force-flush/recall via signal handler

There are corner cases when a site admin is stuck waiting for tsmarchiver/tsmretriever to perform actions (usually tests or error recovery), but the operation gets delayed by ENDIT applying the various timeout/delays configured. The only options today are either to wait, or to edit endit.conf lowering the relevant timeout/delays, restart the ENDIT daemons, let the operation complete, and then revert settings and restart again.

Forcing a flush/recalll in tsmarchiver/tsmretriever should be relatively simple to implement by adding a signal handler for a suitable signal, for example USR1, that sets a variable for the state machines to act on. This would let the admin to simply do kill -USR1 daemonpid to bypass waiting and force immediate action.

tsmretriever: cleanup of in/ on startup

The in/ directory can have stray files from occasions when crashes/restarts have caused things not to clean up normally.

Implement a rudimentary cleanup in tsmretriever that simply stats all files and unlinks those more than a month old according to ctime.

Centralised logging

Tape usage statistics will need some centralized logging of data to generate https://twiki.cern.ch/twiki/bin/view/HEPTape/TapeMetricsJSON

Things to include: hsminstance, total number of files, number of tapes

Each tape mount with: time, number of files, number of bytes

Errors (i.e. all local logging should also be sent remote), for OoD enjoyment

Idea: See if we can use dcache's log4j central logging stuff, so we only have to implement a sender, not a collector.

Investigate removing use of IPC::Run3

At a quick glance, the use of IPC::Run3 is motivated by answering A to a replace-question in tsmretriever.pl.

Investigate whether we should use -replace=All and/or -ifnewer instead.

Reload config automatically/dynamically

Currently config changes requires a restart of the ENDIT daemons. In order to implement backoff mechanisms for sites where lots of reads queued results in starving writes it would be beneficial to be able to dynamically reload parts of, or the entire configuration.

Points to consider:

Automatically detect config file change or require a SIGHUP?
Allow to do this for the entire configuration, or just selected items?
Given that we want to automate the changing of retriever_maxworkers, perhaps only allow defining an override config file (typically in /run/ somewhere) and just dynamically reload that?

Add configurable short/long descriptions

Prepare for centralized logging by adding short/long descriptions to config.

Short: Printed in every log message, typical value same as dCache hsminstance.

Long: Printed on startup (or daily?), might be used for descriptive text or passing metadata to central logs (ie. hsminstance=ops_tape_read tapesize=15T tapesallocated=700 tapespeedmbps=400 or similar).

tsmarchiver: Be more aggressive when retrying

Currently we honor archiver_timeout also for retries when amount is below archiver_threshold1_usage.

However, if set to a large value this might cause stores to fail due to retry happening too late so the store times out on a dCache level.

Suggestion: Introduce a retry timeout (say 1 hour by default) and use the lowest of archiver_timeout and retry timeout when doing retry of stores.

Handle Server disabled errors more gracefully.

When server access is disabled the error message is:
ANS1355E Session rejected: Server disabled

tsm* processes needs to be aware of this error.

Make tsmdeleter volume-aware

Deletions are commonly done in wide campaigns which results in deletions trickling in to us over a longer time period (days/weeks)
TSM space reclamation is rather naive and kicks in when a reclamation limit is reached
- If deletions are still ongoing for a volume this results in newly reclaimed tape volumes with gaps due to files deleted post-reclamation
Workaround is to make tsmdeleter volume-aware
- Require that no deletions has happened on a volume for a set amount of time (one week by default?)
- Or that we will delete all files on a volume
Deletions seems to end up on all involved ENDIT daemon instances now, is this something we can change so all deletions end up on the same instance?
- If not, we can at least sync all instances so deletes happen at the same time.

Prometheus counters for bytes stored and retrieved

It would be useful to have these two metrics, maybe something like:

# HELP endit_archiver_stored_bytes The number of bytes stored to tape by this ENDIT process.
# TYPE endit_archiver_stored_bytes counter

# HELP endit_retriever_restored_bytes The number of bytes restored from tape by this ENDIT process.
# TYPE endit_retriever_restored_bytes counter

Implement backoff when retrying dsmc operations

Currently all retries of dsmc operations are done after a fixed retry period.

In cases where big operations fail (ie. due to server malfunction, broken tape, etc) this can result in logs filling up when using the default retry delay of 60 seconds.

Investigate if we can do exponential backoff on "large" failures, should be identifiable with either dsmc return codes/messages or simply by realising that we're retrying the same files over and over...

Multiple endits on one host

When running multiple read/write pools on the same machine it would be nice to be able to have just one installed version of endit with multiple config files. With the current version you need to install endit several times in different locations.

Add possibility to use dsmc query archive -detail output for tapehints

Fairly recently dsmc query backup/archive -detail gained the possibility to list file locations.

Typical output:

             Size  Archive Date - Time    File - Expires on - Description
             ----  -------------------    -------------------------------
        72 191  B  2016-04-01 04:26:47    /grid/pool/out/00000000045ECF57483D99E8FFB9FEC78EF3 Never endit  RetInit:STARTED  ObjHeld:NO
         Modified: 2016-04-01 01:53:38  Accessed: 2016-04-01 01:13:43  Inode changed: 2016-04-01 02:24:42
         Compression Type: None  Encryption Type:        None  Client-deduplicated: NO
  Media Class: Library  Volume ID: 724403  Restore Order: 00000000-00000002-00000000-02ABF65E

We should be able use this to provide this as an alternative method to produce tape hint files.

Revert tsmarchiver to old behaviour of day/month in description of archived files

In the beginning of time ENDIT archived files with a description on the form 'endit YYYY-MM'. This was however changed to just use the description of 'endit' due to performance limitations of the old TSM (v5 and earlier) server database.

Since v6 TSM is using DB2 so that old issue is moot, and we should revert to the way we were doing this previously.

Having a somewhat unique description helps when needing to delete duplicates, and in general helps to quickly assess file age when viewing dsmc q archive output.

Properly document Prometheus stats file dirs

7a3d50a and 5c5596c adds generation of Prometheus style *.prom files.

README needs to be updated with this, the tmpfiles.d example from the runtime config change thing modified to match. It's likely counter-productive to document the metrics while it's a somewhat moving target until we narrow down exactly what's useful.

Refactor archiver to spawn multiple single-drive dsmc processes instead of varying drive use of a single process

Currently the archiver spawns a single dsmc process, but has the capability of varying the arguments used for that process which is used to specify different resourceutilization which translates to using a different number of drives concurrently.

This works OK in the trivial usecase, but there are a number of corner cases when this is suboptimal:

Inflow of data increases rapidly just after a single-drive archive session has started. Currently we have to wait for the running dsmc to finish before we are able to start using more drives.
If we request use of multiple drives, but not enough drives are free, we can end up with dsmc allocating a subset of the needed drives and then just sit and wait until enough drives are free before continuing. Not using the idle drive not only hurts us, but also other tsm server operations that could make use of that drive.

We need to refactor the archiver to:

Be able to spawn and track multiple dsmc processes, each with a unique subset of files to be archived.
Be able to first spawn one process, and if data inflow increases spawn another one if needed.

Doing this would also enable us to better handle datasets, if support for that eventually shows up in dCache.

Things to remember to cater for:

Related config file changes, document how to migrate settings and handle old settings appropriately (warn or error out?)

Document how to find/delete duplicate archived files

Older ENDIT versions seems to have been prone to archive files multiple times when certain error conditions were triggered.

This should be due to old bugs, but we need to document how to detect if this has happened and most important how to cleanup afterwards.

Procedure is something like:

dsmc q archive -asnode=NODENAME '/path/out/*' and filter out the duplicates
Determine whether we should keep oldest or newest file (likely should not matter, but let's settle for the file that matches the operation that dCache deems successful, probably the last one)
Delete the file. If the descriptions are identical, this will require using dsmc delete archive -pick '/path/to/file' in order to be able to select just one of the duplicates.

Packaging endit scripts

It would be nice if the endit could be packaged since it will make endit updates easier at local site.
The RPM installation path may be different, but if a RPM build script (or template) would be provided, local site admin could alter the script (e.g. changing installation path) to build a customised RPM for local site.

dsmc retrieve failure volume default file list /tapecache/requestlists/default.OQ_knf: child exited with value 8
STDERR: shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

The obvious workaround is to do chdir / on startup like proper daemons do.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.