neicnordic / endit Goto Github PK
View Code? Open in Web Editor NEWEfficient Northern dCache Interface to TSM
License: GNU General Public License v3.0
Efficient Northern dCache Interface to TSM
License: GNU General Public License v3.0
There are install instructions in both README and INSTALL, not necessarily synchronized.
Best way forward is likely to scrap INSTALL and collect all info in README.
It would be useful to have these two metrics, maybe something like:
# HELP endit_archiver_stored_bytes The number of bytes stored to tape by this ENDIT process.
# TYPE endit_archiver_stored_bytes counter
# HELP endit_retriever_restored_bytes The number of bytes restored from tape by this ENDIT process.
# TYPE endit_retriever_restored_bytes counter
There are corner cases when a site admin is stuck waiting for tsmarchiver/tsmretriever to perform actions (usually tests or error recovery), but the operation gets delayed by ENDIT applying the various timeout/delays configured. The only options today are either to wait, or to edit endit.conf lowering the relevant timeout/delays, restart the ENDIT daemons, let the operation complete, and then revert settings and restart again.
Forcing a flush/recalll in tsmarchiver/tsmretriever should be relatively simple to implement by adding a signal handler for a suitable signal, for example USR1, that sets a variable for the state machines to act on. This would let the admin to simply do kill -USR1 daemonpid
to bypass waiting and force immediate action.
The in/ directory can have stray files from occasions when crashes/restarts have caused things not to clean up normally.
Implement a rudimentary cleanup in tsmretriever that simply stats all files and unlinks those more than a month old according to ctime.
Currently the archiver spawns a single dsmc
process, but has the capability of varying the arguments used for that process which is used to specify different resourceutilization which translates to using a different number of drives concurrently.
This works OK in the trivial usecase, but there are a number of corner cases when this is suboptimal:
dsmc
to finish before we are able to start using more drives.dsmc
allocating a subset of the needed drives and then just sit and wait until enough drives are free before continuing. Not using the idle drive not only hurts us, but also other tsm server operations that could make use of that drive.We need to refactor the archiver to:
dsmc
processes, each with a unique subset of files to be archived.Doing this would also enable us to better handle datasets, if support for that eventually shows up in dCache.
Things to remember to cater for:
When server access is disabled the error message is:
ANS1355E Session rejected: Server disabled
tsm* processes needs to be aware of this error.
Fairly recently dsmc query backup/archive -detail gained the possibility to list file locations.
Typical output:
Size Archive Date - Time File - Expires on - Description
---- ------------------- -------------------------------
72 191 B 2016-04-01 04:26:47 /grid/pool/out/00000000045ECF57483D99E8FFB9FEC78EF3 Never endit RetInit:STARTED ObjHeld:NO
Modified: 2016-04-01 01:53:38 Accessed: 2016-04-01 01:13:43 Inode changed: 2016-04-01 02:24:42
Compression Type: None Encryption Type: None Client-deduplicated: NO
Media Class: Library Volume ID: 724403 Restore Order: 00000000-00000002-00000000-02ABF65E
We should be able use this to provide this as an alternative method to produce tape hint files.
dsmc has a few corner case bugs when it can get stuck in a loop consuming 100% cpu and never recover.
Work around this by adding a cputime limit to spawned dsmc:s, 48 cpuhours or so should be enough to not trigger in normal usecases.
Currently config changes requires a restart of the ENDIT daemons. In order to implement backoff mechanisms for sites where lots of reads queued results in starving writes it would be beneficial to be able to dynamically reload parts of, or the entire configuration.
Points to consider:
retriever_maxworkers
, perhaps only allow defining an override config file (typically in /run/
somewhere) and just dynamically reload that?It would be nice if the endit could be packaged since it will make endit updates easier at local site.
The RPM installation path may be different, but if a RPM build script (or template) would be provided, local site admin could alter the script (e.g. changing installation path) to build a customised RPM for local site.
7a3d50a and 5c5596c adds generation of Prometheus style *.prom
files.
README needs to be updated with this, the tmpfiles.d
example from the runtime config change thing modified to match. It's likely counter-productive to document the metrics while it's a somewhat moving target until we narrow down exactly what's useful.
When running multiple read/write pools on the same machine it would be nice to be able to have just one installed version of endit with multiple config files. With the current version you need to install endit several times in different locations.
Prepare for centralized logging by adding short/long descriptions to config.
Short: Printed in every log message, typical value same as dCache hsminstance.
Long: Printed on startup (or daily?), might be used for descriptive text or passing metadata to central logs (ie. hsminstance=ops_tape_read tapesize=15T tapesallocated=700 tapespeedmbps=400 or similar).
Older ENDIT versions seems to have been prone to archive files multiple times when certain error conditions were triggered.
This should be due to old bugs, but we need to document how to detect if this has happened and most important how to cleanup afterwards.
Procedure is something like:
Right now if dCache flushes a file multiple times it'll get multiple copies in the TSM archive which leads to waste of space and slower retrieves.
If the ENDIT daemons are started in a directory that is subsequently deleted, store/retrieve operations will fail with errors similar to:
dsmc retrieve failure volume default file list /tapecache/requestlists/default.OQ_knf: child exited with value 8
STDERR: shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
The obvious workaround is to do chdir / on startup like proper daemons do.
At a quick glance, the use of IPC::Run3 is motivated by answering A to a replace-question in tsmretriever.pl.
Investigate whether we should use -replace=All and/or -ifnewer instead.
Currently all retries of dsmc operations are done after a fixed retry period.
In cases where big operations fail (ie. due to server malfunction, broken tape, etc) this can result in logs filling up when using the default retry delay of 60 seconds.
Investigate if we can do exponential backoff on "large" failures, should be identifiable with either dsmc return codes/messages or simply by realising that we're retrying the same files over and over...
Tape usage statistics will need some centralized logging of data to generate https://twiki.cern.ch/twiki/bin/view/HEPTape/TapeMetricsJSON
Things to include: hsminstance, total number of files, number of tapes
Each tape mount with: time, number of files, number of bytes
Errors (i.e. all local logging should also be sent remote), for OoD enjoyment
Idea: See if we can use dcache's log4j central logging stuff, so we only have to implement a sender, not a collector.
In the beginning of time ENDIT archived files with a description on the form 'endit YYYY-MM'. This was however changed to just use the description of 'endit' due to performance limitations of the old TSM (v5 and earlier) server database.
Since v6 TSM is using DB2 so that old issue is moot, and we should revert to the way we were doing this previously.
Having a somewhat unique description helps when needing to delete duplicates, and in general helps to quickly assess file age when viewing dsmc q archive output.
Currently we honor archiver_timeout also for retries when amount is below archiver_threshold1_usage.
However, if set to a large value this might cause stores to fail due to retry happening too late so the store times out on a dCache level.
Suggestion: Introduce a retry timeout (say 1 hour by default) and use the lowest of archiver_timeout and retry timeout when doing retry of stores.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.