theroddywms / batcheuphoria Goto Github PK
View Code? Open in Web Editor NEWA library to access different kinds of cluster backends
License: MIT License
A library to access different kinds of cluster backends
License: MIT License
It would be quite sensible to share more code between JobManagers; especially those using command line tools, since they have a lot of code duplication.
This would very useful for preventing duplicating bugs that were already fixed, such as #52.
It should actually query the cluster and return status information about all known jobs.
If more jobs are submitted than allowed for the user, Roddy dies with a stack trace. However, this is a BE problem and BE should throw a proper exception, which could then be catched by clients.
A workflow error occurred, try to rollback / abort submitted jobs.
bkill 14843 14844 14848 14849 14850 14852 14856
An unknown / unhandled exception occurred: 'Could not parse raw ID from: 'Group : Pending job threshold reached. Retrying in 60 seconds...''
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.parseJobID(LSFJobManager.groovy:611)
de.dkfz.roddy.execution.jobs.BatchEuphoriaJobManager.extractAndSetJobResultFromExecutionResult(BatchEuphoriaJobManager.groovy:143)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.runJob(LSFJobManager.groovy:148)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$runJob.call(Unknown Source)
de.dkfz.roddy.execution.jobs.Job.run(Job.groovy:534)
de.dkfz.roddy.knowledge.methods.GenericMethod.createAndRunSingleJob(GenericMethod.groovy:507)
de.dkfz.roddy.knowledge.methods.GenericMethod._callGenericToolOrToolArray(GenericMethod.groovy:255)
de.dkfz.roddy.knowledge.methods.GenericMethod.callGenericTool(GenericMethod.groovy:50)
de.dkfz.b080.co.files.CoverageTextFile.plot(CoverageTextFile.java:49)
de.dkfz.b080.co.files.CoverageTextFileGroup.plot(CoverageTextFileGroup.java:47)
de.dkfz.b080.co.qcworkflow.QCPipeline.execute(QCPipeline.groovy:90)
de.dkfz.roddy.core.ExecutionContext.execute(ExecutionContext.groovy:625)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:397)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:341)
de.dkfz.roddy.core.Analysis.rerun(Analysis.java:229)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.rerun(RoddyCLIClient.groovy:513)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.parseStartupMode(RoddyCLIClient.groovy:116)
de.dkfz.roddy.Roddy.parseRoddyStartupModeAndRun(Roddy.java:721)
de.dkfz.roddy.Roddy.startup(Roddy.java:289)
de.dkfz.roddy.Roddy.main(Roddy.java:216)
The reason here is an interaction of Roddy and BE. It seems BE searches for a specific line in the output that is not found. Instead returns the wait-notice above. When s.th. like this happens on the command line upon manual submission, bsub blocks.
When executing a command (bsub, bjobs, etc.) fails and BE throws an exception, it would be nice if the exception message would contain stderr of that command.
To implement this, method in ExecutionResult needs to be extended.
When I pass environment variables to the constructor of BEJob
, I expect that they are passed to qsub with the -v
parameter; instead -v PARAMETER_FILE=null
is used.
A builder for BEJob would be nice, the constructor is a bit unclear with many parameters, not all are needed
LSF e.g. will just block and retry command submission, PBS does not (or at least our installation). Or it depends on the configuration and that can lead to funny behaviour. Anyways the developer should be aware of the fact and the TimeoutException might do just the right thing. Currently BE only offers the base class or interface, but this can then be used to force the exception.
In the method PbsJobManager.runJob
, if a job fails, runResult
is not set but runResult.wasExecuted
is called which result in an NPE.
This is especially problematic because LsfJobManager calls setJobInfoForJobDetails every time job states are updated, not only when statistics are requested.
Command executed by BE:
bjobs -noheader -a -o "jobid job_name stat user queue job_description proj_name job_group job_priority pids exit_code from_host exec_host submit_time start_time finish_time cpu_used run_time user_group swap max_mem runtimelimit sub_cwd pend_reason exec_cwd output_file input_file effective_resreq exec_home slots delimiter='<'" -u otptest
Output of bjobs:
5935<otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob<DONE<otptest<medium-dmg<-<default<-<-<-<-<tbi-cn010<tbi-cn010<Sep 29 15:06<Sep 29 15:06<Sep 29 15:06 L<0.1 second(s)<0 second(s)<-<-<-<20.0/tbi-cn010<$HOME<-</home/otptest</ibios/dmdc/otp/workflow-tests/tmp/DataInstallationWorkflow-klinga-2017-09-29-15-05-07-297+0200-9AFOIiRNYiMw66qP/logging_root_path/clusterLog/2017-09-29/otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob.o5935<-<select[type == any ] order[r15s:pg] </home/otptest<1
5934<otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob<DONE<otptest<medium-dmg<-<default<-<-<-<-<tbi-cn010<tbi-cn010<Sep 29 15:06<Sep 29 15:06<Sep 29 15:06 L<0.1 second(s)<0 second(s)<-<-<-<20.0/tbi-cn010<$HOME<-</home/otptest</ibios/dmdc/otp/workflow-tests/tmp/DataInstallationWorkflow-klinga-2017-09-29-15-05-07-297+0200-9AFOIiRNYiMw66qP/logging_root_path/clusterLog/2017-09-29/otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob.o5934<-<select[type == any ] order[r15s:pg] </home/otptest<1
Exception:
Duration string is not of the format HH+:MM:SS: '0.1 second(s)'
Error |
de.dkfz.roddy.BEException: Duration string is not of the format HH+:MM:SS: '0.1 second(s)'
Error |
at de.dkfz.roddy.execution.jobs.cluster.ClusterJobManager.parseColonSeparatedHHMMSSDuration(ClusterJobManager.groovy:104)
Error |
at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.setJobInfoForJobDetails(LSFJobManager.groovy:475)
Error |
at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.updateJobStatus(LSFJobManager.groovy:373)
Error |
at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.queryJobStatusAll(LSFJobManager.groovy:82)
Error |
at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$queryJobStatusAll$2.call(Unknown Source)
Error |
at de.dkfz.tbi.otp.job.processing.ClusterJobSchedulerService.retrieveKnownJobsWithState(ClusterJobSchedulerService.groovy:163)
It should be possible to set the host when using keys: RestExecutionService(File keystoreLocation, String keyStorePassword)
Build fails on CentOS 7 with this message:
$ gradle jar
Starting a Gradle Daemon (subsequent builds will be faster)
FAILURE: Build failed with an exception.
* Where:
Build file '/data/code/BatchEuphoria/build.gradle' line: 22
* What went wrong:
A problem occurred evaluating root project 'BatchEuphoria'.
> Command failed with code 129: '/usr/bin/env git -C /data/code/BatchEuphoria describe --tags --dirty'
Error output: 'Unknown option: -Cusage: git [--version] [--help] [-c name=value] [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path] [-p|--paginate|--no-pager] [--no-replace-objects] [--bare] [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>] <command> [<args>]'
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.
BUILD FAILED
Available options in PBS and LSF:
PBS
-m | mail_options | Defines the set of conditions under which the execution server will send a mail message about the job. The mail_options argument is a string which consists of either the single character "n", or one or more of the characters "a", "b", and "e".
If the character "n" is specified, no normal mail is sent. Mail for job cancels and other events outside of normal job processing are still sent.
For the letters "a", "b", and "e":
a – Mail is sent when the job is aborted by the batch system.
b – Mail is sent when the job begins execution.
e – Mail is sent when the job terminates.
If the -m option is not specified, mail will be sent if the job is aborted.
-M | user_list | Declares the list of users to whom mail is sent by the execution server when it sends mail about the job.
The user_list argument is of the form:
user[@host][,user[@host],...]
If unset, the list defaults to the submitting user at the qsub host, i.e. the job owner.
LSF
-u mail_user
Sends mail to the specified email destination. To specify a Windows user account, include the domain name in uppercase letters and use a single backslash (DOMAIN_NAME\user_name) in a Windows command line or a double backslash (DOMAIN_NAME\\user_name) in a UNIX command line.
-N
Sends the job report to you by mail when the job finishes. When used without any other options, behaves the same as the default.
Use only with -o, -oo, -I, -Ip, and -Is options, which do not send mail, to force LSF to send you a mail message when the job is done.
-B
Sends mail to you when the job is dispatched and begins execution.
set -o
or -oo
to disable sending mail; if not log is required, set -o /dev/null
Introduce getters for "*_JOBID" and similar scheduler specific environment variables in ClusterJobManager interface and implementations.
For OTP we only need JOBID at the moment, but other variables might also be useful for others. A list can be found here under Environment.
PBS_SCRATCH_DIR (used in PBSJobManager) is not a PBS specific variable, but specific to our cluster and maybe shouldn't be in BE.
The string should be in the same format, preferably without ${}, so they can also be used in other places than bash scripts.
It would be great if BE had a method to query the job states for all jobs of a user.
I noticed that it already updates the all job states internally, and for OTP it would be useful to expose that information.
In LsfJobManager.runJob(BEJob job)
, the cacheLock
is locked but never unlocked.
This means in all subsequent operations wait forever.
There should be an option per job to set the used shell to interpret the job.
Comment form @dankwart-de:
Please also unify the convertResources method across the classes. There are code duplicates and missing functionality e.g. in the GE JobM. However, we should also make sure, that we convert the resource set entries to the right target values.
It would be nice if you could set the account name (parameter -A in PBS) also per job, and not only per job manager
If job submission with jobManager.runJob(job)
fails, it would be useful if there was more information why it failed.
Maybe add an additional field to BEJobResult.
Before commit af9524d, information about used swap was set in setJobInfoForJobDetails(), but that seemed to get lost in that commit.
When I call one of the queryJobStatus* methods in PBSJobManager, and get back an empty map, I don't know whether no jobs are running anymore, or updateJobStatus/qstat failed.
with this exception:
java.lang.NullPointerException
at java.time.Duration.between(Duration.java:473)
at de.dkfz.roddy.execution.RestExecutionService.execute(RestExecutionService.groovy:171)
at de.dkfz.roddy.execution.RestExecutionService.logon(RestExecutionService.groovy:111)
at de.dkfz.roddy.execution.RestExecutionService.<init>(RestExecutionService.groovy:92)
at de.dkfz.tbi.otp.job.processing.ClusterJobManagerFactoryService.getJobManager(ClusterJobManagerFactoryService.groovy:36)
…
There are only two (!) unit tests for PBS cluster operations, but every method in every JobManager should be tested.
Is it necessary?
In PBS, if you pass
You need to pass -joe to combine stdout and stderr
To write nothing, the you need to use the -k option
Variable "$PBS_JOBID" in the path will be replaced by the job id
In LFS, if you pass
Stderr is written to the same file as stdout if you don't pass -eo/-e; it seems that also happens if you pass the same directory to both -oo and -eo.
%J in the path will be replaced by the job id, and %I by the array index
BE should make sure that if you use the same options, the same files are written with all systems.
This exception occurred today in OTP:
2017-09-15 09:34:35,717 [taskScheduler-8] INFO scheduler.ClusterJobMonitoringService - 15227597 finished on Realm Realm 28110774 DKFZ_13.1 DAT
A_MANAGEMENT production
2017-09-15 09:34:35,972 [taskScheduler-8] WARN scheduler.ClusterJobMonitoringService - Failed to fill in runtime statistics for Cluster job 15
227597 on Realm 28110774 DKFZ_13.1 DATA_MANAGEMENT production with user otp
java.time.format.DateTimeParseException: Text cannot be parsed to a Duration
at java.time.Duration.parse(Duration.java:412)
at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager$_processQstatOutput_closure15.doCall(PBSJobManager.groovy:776)
at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager.processQstatOutput(PBSJobManager.groovy:735)
at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager.queryExtendedJobStateById(PBSJobManager.groovy:591)
at de.dkfz.tbi.otp.job.processing.ClusterJobSchedulerService.retrieveAndSaveJobStatistics(ClusterJobSchedulerService.groovy:153)
at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService$_check_closure5$_closure9.doCall(ClusterJobMonitoringService.groovy:132)
at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService$_check_closure5.doCall(ClusterJobMonitoringService.groovy:121)
at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService.check(ClusterJobMonitoringService.groovy:116)
at de.dkfz.tbi.otp.job.scheduler.SchedulerService.clusterJobCheck(SchedulerService.groovy:301)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Job
→ BEJob
toolScript
argument of BEJob
constructor shouldn't be double escapedOur admin just explained me, that a default and standard behaviour for non interactive sessions is, that e.g. bsub is not in the path, when it is called via ssh.
For our PBS cluster we had a SUSE system which was behaving in a non-standard way (as told to me). There, the .bashrc was also loaded in non-interactive sessions.
Our LSF cluster behaves standard conform (as explained) and there we'd either need to add something to the .bashrc or .bash_profile OR we need to call bsub/bjobs etc. with the full path.
For this case, we could introduce:
It has several similar (and apparently unused?) methods, maybe some of them could be removed.
For some methods it's not obvious how to implement them, so it would be nice if there was some documentation.
There's an inifinte lopp in BEJob.getLoggingDirectory
, the method calls itself
When using SNAPSHOT versions, the dependency is downloaded every time. this also happens for transitive dependencies, such as when compiling OTP.
The server sends the following response because the password contains an &
.
2017-09-26 15:54:04,565 [pool-4-thread-1] DEBUG http.wire - http-outgoing-1 << "JAXBException occurred : The entity name must immediately follow the '&' in the entity reference.. The entity name must immediately follow the '&' in the entity reference.. "
These states are never set and maybe could be deleted:
UNKNOWN_SUBMITTED: Recently submitted job, jobState is unknown
(submitted jobs are actually set to HOLD
, QUEUED
or FAILED
)UNKNOWN_READOUT
, UNSTARTED
, FAILED_POSSIBLE
For these states the documentation could be improved:
FAILED: BEJob has failed
: this is only used when an error occurred while submitting the jobOK: BEJob was ok
: this seems to be used by array head jobs only? It is not exactly clear what ok means (maybe use COMPLETED_SUCCESSFULLY
if the cluster scheduler can return whether a job was successful?)ABORTED
It just returns BEJob.id
.
This causes problems in OTP because it expecpts the short ID (number only).
Output when executing:
$ bjobs -noheader -o \"jobid job_name stat user queue job_description proj_name job_group job_priority pids exit_code from_host exec_host submit_time start_time finish_time cpu_used run_time user_group swap max_mem runtimelimit sub_cwd pend_reason exec_cwd output_file input_file effective_resreq exec_home slots delimiter=\'\<\'\" -u otptest
job_name: Illegal job ID.
' shouldn't be escaped, " only once
On some systems, the cluster tools such as qsub, bsub, sbatch etc, might be in a directory which is not in the PATH
variable. It might be useful to have an optional configuration parameter to pass this directory to BE.
In PbsJobManager.runJob
, a lock (cacheLock
) is acquired but is never released, which will cause the methods using it to wait forever if they're called from different threads.
bjobs
is called without the option -a
, which is required to also show finished jobs.
Form the manpage:
-a
Displays information about jobs in all states, including finished jobs that finished recently, within an interval specified by CLEAN_PERIOD in lsb.params (the default period is 1 hour).
Use -a with -x option to display all jobs that have triggered a job exception (overrun, underrun, idle).
When using toolString,
PbsCommand.toString
results in an NPEBEJob.toString
results in an NPEPbsJobManager.createCommand
results in an NPEtoolScript
into qsub
, instead of using it as an argugment)In the constructor of RestResult
, successful
is set to true
if the statusCode
is 200, but the getter in ExecutionResult
doesn't use successful
, but instead uses the exitCode
.
When trying to run gradle build
, it fails with the following output:
FAILURE: Build failed with an exception.
* Where:
Build file '/data/code/BatchEuphoria/build.gradle' line: 21
* What went wrong:
A problem occurred evaluating root project 'BatchEuphoria'.
> Command failed with code 128: '/usr/bin/env git -C /data/code/BatchEuphoria describe --tags --dirty'
Error output: 'fatal: No names found, cannot describe anything.'
de.dkfz.roddy.execution.cluster.pbs.PBSCommandParserTest > testConstructAndParseSimpleJob FAILED
org.codehaus.groovy.runtime.powerassert.PowerAssertionError at PBSCommandParserTest.groovy:49
Don't know why. The test works fine on my local machine and does not use the file or cluster system.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.