theroddywms / batcheuphoria Goto Github PK

View Code? Open in Web Editor NEW

3.0 11.0 5.0 1.56 MB

A library to access different kinds of cluster backends

License: MIT License

Groovy 98.30% Java 1.70%

job-scheduler pbs batch cluster lsf sge roddy

batcheuphoria's People

Contributors

Stargazers

Watchers

Forkers

fkaercher askask fossabot herrniffler gordi

batcheuphoria's Issues

LSF: Use to BEJob.loggingDirectory to set log directory

Share code between JobManagers

It would be quite sensible to share more code between JobManagers; especially those using command line tools, since they have a lot of code duplication.
This would very useful for preventing duplicating bugs that were already fixed, such as #52.

LsfJobManager.queryJobStatusAll() always returns an empty map

It should actually query the cluster and return status information about all known jobs.

BatchEuphoria does not handle full queues properly.

If more jobs are submitted than allowed for the user, Roddy dies with a stack trace. However, this is a BE problem and BE should throw a proper exception, which could then be catched by clients.

A workflow error occurred, try to rollback / abort submitted jobs.
bkill 14843 14844 14848 14849 14850 14852 14856
An unknown / unhandled exception occurred: 'Could not parse raw ID from: 'Group : Pending job threshold reached. Retrying in 60 seconds...''
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.parseJobID(LSFJobManager.groovy:611)
de.dkfz.roddy.execution.jobs.BatchEuphoriaJobManager.extractAndSetJobResultFromExecutionResult(BatchEuphoriaJobManager.groovy:143)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.runJob(LSFJobManager.groovy:148)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$runJob.call(Unknown Source)
de.dkfz.roddy.execution.jobs.Job.run(Job.groovy:534)
de.dkfz.roddy.knowledge.methods.GenericMethod.createAndRunSingleJob(GenericMethod.groovy:507)
de.dkfz.roddy.knowledge.methods.GenericMethod._callGenericToolOrToolArray(GenericMethod.groovy:255)
de.dkfz.roddy.knowledge.methods.GenericMethod.callGenericTool(GenericMethod.groovy:50)
de.dkfz.b080.co.files.CoverageTextFile.plot(CoverageTextFile.java:49)
de.dkfz.b080.co.files.CoverageTextFileGroup.plot(CoverageTextFileGroup.java:47)
de.dkfz.b080.co.qcworkflow.QCPipeline.execute(QCPipeline.groovy:90)
de.dkfz.roddy.core.ExecutionContext.execute(ExecutionContext.groovy:625)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:397)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:341)
de.dkfz.roddy.core.Analysis.rerun(Analysis.java:229)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.rerun(RoddyCLIClient.groovy:513)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.parseStartupMode(RoddyCLIClient.groovy:116)
de.dkfz.roddy.Roddy.parseRoddyStartupModeAndRun(Roddy.java:721)
de.dkfz.roddy.Roddy.startup(Roddy.java:289)
de.dkfz.roddy.Roddy.main(Roddy.java:216)
The reason here is an interaction of Roddy and BE. It seems BE searches for a specific line in the output that is not found. Instead returns the wait-notice above. When s.th. like this happens on the command line upon manual submission, bsub blocks.

Exceptions should contain stderr when a command failed

When executing a command (bsub, bjobs, etc.) fails and BE throws an exception, it would be nice if the exception message would contain stderr of that command.
To implement this, method in ExecutionResult needs to be extended.

Environment variables don't work for PBS

When I pass environment variables to the constructor of BEJob, I expect that they are passed to qsub with the -v parameter; instead -v PARAMETER_FILE=null is used.

Implement queryExtendedJobStateById for LSF

Builder for BEJob

A builder for BEJob would be nice, the constructor is a bit unclear with many parameters, not all are needed

Implement and test SLURM support

Implement Job Manager Builder code to support maps and maybe introduce specialized classes

Let the ExecutionService base class throw TimeoutException on failed execute

LSF e.g. will just block and retry command submission, PBS does not (or at least our installation). Or it depends on the configuration and that can lead to funny behaviour. Anyways the developer should be aware of the fact and the TimeoutException might do just the right thing. Currently BE only offers the base class or interface, but this can then be used to force the exception.

Failed PBS job results in an NPE instead of an error

In the method PbsJobManager.runJob, if a job fails, runResult is not set but runResult.wasExecuted is called which result in an NPE.

LSF uses strange duration format, which causes exception in setJobInfoForJobDetails

This is especially problematic because LsfJobManager calls setJobInfoForJobDetails every time job states are updated, not only when statistics are requested.

Command executed by BE:

bjobs -noheader -a -o "jobid job_name stat user queue job_description proj_name job_group job_priority pids exit_code from_host exec_host submit_time start_time finish_time cpu_used run_time user_group swap max_mem runtimelimit sub_cwd pend_reason exec_cwd output_file input_file effective_resreq exec_home slots delimiter='<'" -u otptest

Output of bjobs:

5935<otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob<DONE<otptest<medium-dmg<-<default<-<-<-<-<tbi-cn010<tbi-cn010<Sep 29 15:06<Sep 29 15:06<Sep 29 15:06 L<0.1 second(s)<0 second(s)<-<-<-<20.0/tbi-cn010<$HOME<-</home/otptest</ibios/dmdc/otp/workflow-tests/tmp/DataInstallationWorkflow-klinga-2017-09-29-15-05-07-297+0200-9AFOIiRNYiMw66qP/logging_root_path/clusterLog/2017-09-29/otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob.o5935<-<select[type == any ] order[r15s:pg] </home/otptest<1
5934<otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob<DONE<otptest<medium-dmg<-<default<-<-<-<-<tbi-cn010<tbi-cn010<Sep 29 15:06<Sep 29 15:06<Sep 29 15:06 L<0.1 second(s)<0 second(s)<-<-<-<20.0/tbi-cn010<$HOME<-</home/otptest</ibios/dmdc/otp/workflow-tests/tmp/DataInstallationWorkflow-klinga-2017-09-29-15-05-07-297+0200-9AFOIiRNYiMw66qP/logging_root_path/clusterLog/2017-09-29/otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob.o5934<-<select[type == any ] order[r15s:pg] </home/otptest<1

Exception:

Duration string is not of the format HH+:MM:SS: '0.1 second(s)'
Error |
de.dkfz.roddy.BEException: Duration string is not of the format HH+:MM:SS: '0.1 second(s)'
Error |
	at de.dkfz.roddy.execution.jobs.cluster.ClusterJobManager.parseColonSeparatedHHMMSSDuration(ClusterJobManager.groovy:104)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.setJobInfoForJobDetails(LSFJobManager.groovy:475)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.updateJobStatus(LSFJobManager.groovy:373)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.queryJobStatusAll(LSFJobManager.groovy:82)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$queryJobStatusAll$2.call(Unknown Source)
Error |
	at de.dkfz.tbi.otp.job.processing.ClusterJobSchedulerService.retrieveKnownJobsWithState(ClusterJobSchedulerService.groovy:163)

Constructor for RestExecutionService is missing host argument

It should be possible to set the host when using keys: RestExecutionService(File keystoreLocation, String keyStorePassword)

Build fails (unknown git option)

Build fails on CentOS 7 with this message:

$ gradle jar
Starting a Gradle Daemon (subsequent builds will be faster)

FAILURE: Build failed with an exception.

* Where:
Build file '/data/code/BatchEuphoria/build.gradle' line: 22

* What went wrong:
A problem occurred evaluating root project 'BatchEuphoria'.
> Command failed with code 129: '/usr/bin/env git -C /data/code/BatchEuphoria describe --tags --dirty'
  Error output: 'Unknown option: -Cusage: git [--version] [--help] [-c name=value]           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]           <command> [<args>]'

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Make mail sending configurable

Available options in PBS and LSF:

PBS

-m | mail_options | Defines the set of conditions under  which the execution server will send a mail message about the job. The  mail_options argument is a string which consists of either the single  character "n", or one or more of the characters "a", "b", and "e".
                                 If the character "n" is specified, no  normal mail is sent. Mail for job cancels and other events outside of  normal job processing are still sent.
                                 For the letters "a", "b", and "e": 
                                     a – Mail is sent when the job is aborted by the batch system.
                                     b – Mail is sent when the job begins execution.
                                     e – Mail is sent when the job terminates.
                                  If the -m option is not specified, mail will be sent if the job is aborted.

-M | user_list | Declares the list of users to whom mail is sent by the execution server when it sends mail about the job.
                                 The user_list argument is of the form:
                                 user[@host][,user[@host],...]
                                  If unset, the list defaults to the submitting user at the qsub host, i.e. the job owner.

LSF

-u mail_user
    Sends mail to the specified email destination. To specify a Windows user account, include the domain name in uppercase letters and use a single backslash (DOMAIN_NAME\user_name) in a Windows command line or a double backslash (DOMAIN_NAME\\user_name) in a UNIX command line.

-N
    Sends the job report to you by mail when the job finishes. When used without any other options, behaves the same as the default.
    Use only with -o, -oo, -I, -Ip, and -Is options, which do not send mail, to force LSF to send you a mail message when the job is done.

-B
    Sends mail to you when the job is dispatched and begins execution.

set -o or -oo to disable sending mail; if not log is required, set -o /dev/null

Introduce getter for "JOB_ID" environment variable and use consistent format

Introduce getters for "*_JOBID" and similar scheduler specific environment variables in ClusterJobManager interface and implementations.
For OTP we only need JOBID at the moment, but other variables might also be useful for others. A list can be found here under Environment.
PBS_SCRATCH_DIR (used in PBSJobManager) is not a PBS specific variable, but specific to our cluster and maybe shouldn't be in BE.
The string should be in the same format, preferably without ${}, so they can also be used in other places than bash scripts.

Method to query job state for all jobs

It would be great if BE had a method to query the job states for all jobs of a user.
I noticed that it already updates the all job states internally, and for OTP it would be useful to expose that information.

PbsJobManager.queryExtendedJobStateById() returns map with short IDs, if you pass long IDs

LsfJobManager: cacheLock not unlocked

In LsfJobManager.runJob(BEJob job), the cacheLock is locked but never unlocked.
This means in all subsequent operations wait forever.

Option to set shell to use for a job

There should be an option per job to set the used shell to interpret the job.

Refactor/reconsider convertResources method

Comment form @dankwart-de:

Please also unify the convertResources method across the classes. There are code duplicates and missing functionality e.g. in the GE JobM. However, we should also make sure, that we convert the resource set entries to the right target values.

Add possiblity to set account name per job

It would be nice if you could set the account name (parameter -A in PBS) also per job, and not only per job manager

Builder for JobManager parameters doesn't include method for strictMode

Provide information why job submission failed

If job submission with jobManager.runJob(job) fails, it would be useful if there was more information why it failed.
Maybe add an additional field to BEJobResult.

Swap info in Lsf(Rest)JobManager not set

Before commit af9524d, information about used swap was set in setJobInfoForJobDetails(), but that seemed to get lost in that commit.

LSF: Reconnect after session time-out

User should know if PBSJobManager.updateJobStatus failed

When I call one of the queryJobStatus* methods in PBSJobManager, and get back an empty map, I don't know whether no jobs are running anymore, or updateJobStatus/qstat failed.

Constructor of RestExecutionService fails

with this exception:

java.lang.NullPointerException
	at java.time.Duration.between(Duration.java:473)
	at de.dkfz.roddy.execution.RestExecutionService.execute(RestExecutionService.groovy:171)
	at de.dkfz.roddy.execution.RestExecutionService.logon(RestExecutionService.groovy:111)
	at de.dkfz.roddy.execution.RestExecutionService.<init>(RestExecutionService.groovy:92)
	at de.dkfz.tbi.otp.job.processing.ClusterJobManagerFactoryService.getJobManager(ClusterJobManagerFactoryService.groovy:36)
…

LSFRestJobManager.queryJobStatusAll should be implemented

Test coverage is not so good

There are only two (!) unit tests for PBS cluster operations, but every method in every JobManager should be tested.

BEJob.lastCommand is never set

Is it necessary?

Handle location of log files for different systems

In PBS, if you pass

nothing: a file a file [jobname].[o|e][jobid] in your home directory
-o/-e with an existing directory, it will write a file [jobname].[o|e][jobid] in that directory
-o/-e with a filename in an existing directory, it will use that filename, if the parent directory doesn't exist, nothing is written

You need to pass -joe to combine stdout and stderr
To write nothing, the you need to use the -k option
Variable "$PBS_JOBID" in the path will be replaced by the job id

In LFS, if you pass

nothing: nothing is written
-oo/-eo with an existing directory, it will write a file [jobid].out in that directory
-oo/-eo with a filename in an existing directory, it will use that filename, if the parent directory doesn't exist, nothing is written

Stderr is written to the same file as stdout if you don't pass -eo/-e; it seems that also happens if you pass the same directory to both -oo and -eo.
%J in the path will be replaced by the job id, and %I by the array index

BE should make sure that if you use the same options, the same files are written with all systems.

PbsJobManager failed to read runtime statistics

This exception occurred today in OTP:

2017-09-15 09:34:35,717 [taskScheduler-8] INFO  scheduler.ClusterJobMonitoringService  - 15227597 finished on Realm Realm 28110774 DKFZ_13.1 DAT
A_MANAGEMENT production
2017-09-15 09:34:35,972 [taskScheduler-8] WARN  scheduler.ClusterJobMonitoringService  - Failed to fill in runtime statistics for Cluster job 15
227597 on Realm 28110774 DKFZ_13.1 DATA_MANAGEMENT production with user otp
java.time.format.DateTimeParseException: Text cannot be parsed to a Duration
        at java.time.Duration.parse(Duration.java:412)
        at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager$_processQstatOutput_closure15.doCall(PBSJobManager.groovy:776)
        at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager.processQstatOutput(PBSJobManager.groovy:735)
        at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager.queryExtendedJobStateById(PBSJobManager.groovy:591)
        at de.dkfz.tbi.otp.job.processing.ClusterJobSchedulerService.retrieveAndSaveJobStatistics(ClusterJobSchedulerService.groovy:153)
        at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService$_check_closure5$_closure9.doCall(ClusterJobMonitoringService.groovy:132)
        at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService$_check_closure5.doCall(ClusterJobMonitoringService.groovy:121)
        at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService.check(ClusterJobMonitoringService.groovy:116)
        at de.dkfz.tbi.otp.job.scheduler.SchedulerService.clusterJobCheck(SchedulerService.groovy:301)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Code example in README doesn't work

Job → BEJob
toolScript argument of BEJob constructor shouldn't be double escaped
There's no explanation that jobs are held by default and need to be released

We need to discuss, if we want to allow configuration paths of e.g. bsub binary configuration

Our admin just explained me, that a default and standard behaviour for non interactive sessions is, that e.g. bsub is not in the path, when it is called via ssh.

For our PBS cluster we had a SUSE system which was behaving in a non-standard way (as told to me). There, the .bashrc was also loaded in non-interactive sessions.

Our LSF cluster behaves standard conform (as explained) and there we'd either need to add something to the .bashrc or .bash_profile OR we need to call bsub/bjobs etc. with the full path.

For this case, we could introduce:

A check, if the submission binaries are accessible
Informational messages, if they could not be found and what to do
A way to configure the paths of used submission binaries

ExececutionService could be simplified and documented

It has several similar (and apparently unused?) methods, maybe some of them could be removed.

For some methods it's not obvious how to implement them, so it would be nice if there was some documentation.

Infinite loop in BEJob

There's an inifinte lopp in BEJob.getLoggingDirectory, the method calls itself

Don't depend on SNAPSHOT versions

When using SNAPSHOT versions, the dependency is downloaded every time. this also happens for transitive dependencies, such as when compiling OTP.

RestExecutionService.logon doesn't encode password

The server sends the following response because the password contains an &.

2017-09-26 15:54:04,565 [pool-4-thread-1] DEBUG http.wire  - http-outgoing-1 << "JAXBException occurred : The entity name must immediately follow the '&' in the entity reference.. The entity name must immediately follow the '&' in the entity reference.. "

JobStates and their documentation could be improved

These states are never set and maybe could be deleted:

UNKNOWN_SUBMITTED: Recently submitted job, jobState is unknown (submitted jobs are actually set to HOLD, QUEUED or FAILED)
UNKNOWN_READOUT, UNSTARTED, FAILED_POSSIBLE
(if they are required for Roddy document that they are not set by BE and users don't have to handle them)

For these states the documentation could be improved:

FAILED: BEJob has failed: this is only used when an error occurred while submitting the job
OK: BEJob was ok: this seems to be used by array head jobs only? It is not exactly clear what ok means (maybe use COMPLETED_SUCCESSFULLY if the cluster scheduler can return whether a job was successful?)
ABORTED

BEJob.getShortID() doesn't return short ID

It just returns BEJob.id.
This causes problems in OTP because it expecpts the short ID (number only).

LsfJobManager: bjobs command is escaped wrong

Output when executing:

$ bjobs -noheader -o \"jobid job_name stat user queue job_description proj_name job_group job_priority pids exit_code from_host exec_host submit_time start_time finish_time cpu_used run_time user_group swap max_mem runtimelimit sub_cwd pend_reason exec_cwd output_file input_file effective_resreq exec_home slots delimiter=\'\<\'\" -u otptest 

job_name: Illegal job ID.

' shouldn't be escaped, " only once

Add configuration for path of cluster tools

On some systems, the cluster tools such as qsub, bsub, sbatch etc, might be in a directory which is not in the PATH variable. It might be useful to have an optional configuration parameter to pass this directory to BE.

Lock not released in PbsJobManager.runJob

In PbsJobManager.runJob, a lock (cacheLock) is acquired but is never released, which will cause the methods using it to wait forever if they're called from different threads.

LsfJobManager: Job statistics not available

bjobs is called without the option -a, which is required to also show finished jobs.

Form the manpage:

-a

    Displays information about jobs in all states, including finished jobs that finished recently, within an interval specified by CLEAN_PERIOD in lsb.params (the default period is 1 hour).

    Use -a with -x option to display all jobs that have triggered a job exception (overrun, underrun, idle).

Several problems when using toolScript

When using toolString,

calling PbsCommand.toString results in an NPE
calling BEJob.toString results in an NPE
calling PbsJobManager.createCommand results in an NPE
the PBS submission command is not created correctly (should pipe the toolScript into qsub, instead of using it as an argugment)

RestResult.getSuccessful returns wrong result

In the constructor of RestResult, successful is set to true if the statusCode is 200, but the getter in ExecutionResult doesn't use successful, but instead uses the exitCode.

Gradle build fails

When trying to run gradle build, it fails with the following output:

FAILURE: Build failed with an exception.

* Where:
Build file '/data/code/BatchEuphoria/build.gradle' line: 21

* What went wrong:
A problem occurred evaluating root project 'BatchEuphoria'.
> Command failed with code 128: '/usr/bin/env git -C /data/code/BatchEuphoria describe --tags --dirty'
  Error output: 'fatal: No names found, cannot describe anything.'

Build fails

de.dkfz.roddy.execution.cluster.pbs.PBSCommandParserTest > testConstructAndParseSimpleJob FAILED
    org.codehaus.groovy.runtime.powerassert.PowerAssertionError at PBSCommandParserTest.groovy:49

Don't know why. The test works fine on my local machine and does not use the file or cluster system.