Giter Site home page Giter Site logo

batcheuphoria's People

Contributors

askask avatar dankwart-de avatar fkaercher avatar gordi avatar vinjana avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

batcheuphoria's Issues

Share code between JobManagers

It would be quite sensible to share more code between JobManagers; especially those using command line tools, since they have a lot of code duplication.
This would very useful for preventing duplicating bugs that were already fixed, such as #52.

BatchEuphoria does not handle full queues properly.

If more jobs are submitted than allowed for the user, Roddy dies with a stack trace. However, this is a BE problem and BE should throw a proper exception, which could then be catched by clients.

A workflow error occurred, try to rollback / abort submitted jobs.
bkill 14843 14844 14848 14849 14850 14852 14856
An unknown / unhandled exception occurred: 'Could not parse raw ID from: 'Group : Pending job threshold reached. Retrying in 60 seconds...''
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.parseJobID(LSFJobManager.groovy:611)
de.dkfz.roddy.execution.jobs.BatchEuphoriaJobManager.extractAndSetJobResultFromExecutionResult(BatchEuphoriaJobManager.groovy:143)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.runJob(LSFJobManager.groovy:148)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$runJob.call(Unknown Source)
de.dkfz.roddy.execution.jobs.Job.run(Job.groovy:534)
de.dkfz.roddy.knowledge.methods.GenericMethod.createAndRunSingleJob(GenericMethod.groovy:507)
de.dkfz.roddy.knowledge.methods.GenericMethod._callGenericToolOrToolArray(GenericMethod.groovy:255)
de.dkfz.roddy.knowledge.methods.GenericMethod.callGenericTool(GenericMethod.groovy:50)
de.dkfz.b080.co.files.CoverageTextFile.plot(CoverageTextFile.java:49)
de.dkfz.b080.co.files.CoverageTextFileGroup.plot(CoverageTextFileGroup.java:47)
de.dkfz.b080.co.qcworkflow.QCPipeline.execute(QCPipeline.groovy:90)
de.dkfz.roddy.core.ExecutionContext.execute(ExecutionContext.groovy:625)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:397)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:341)
de.dkfz.roddy.core.Analysis.rerun(Analysis.java:229)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.rerun(RoddyCLIClient.groovy:513)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.parseStartupMode(RoddyCLIClient.groovy:116)
de.dkfz.roddy.Roddy.parseRoddyStartupModeAndRun(Roddy.java:721)
de.dkfz.roddy.Roddy.startup(Roddy.java:289)
de.dkfz.roddy.Roddy.main(Roddy.java:216)
The reason here is an interaction of Roddy and BE. It seems BE searches for a specific line in the output that is not found. Instead returns the wait-notice above. When s.th. like this happens on the command line upon manual submission, bsub blocks.

Exceptions should contain stderr when a command failed

When executing a command (bsub, bjobs, etc.) fails and BE throws an exception, it would be nice if the exception message would contain stderr of that command.
To implement this, method in ExecutionResult needs to be extended.

Environment variables don't work for PBS

When I pass environment variables to the constructor of BEJob, I expect that they are passed to qsub with the -v parameter; instead -v PARAMETER_FILE=null is used.

Builder for BEJob

A builder for BEJob would be nice, the constructor is a bit unclear with many parameters, not all are needed

Let the ExecutionService base class throw TimeoutException on failed execute

LSF e.g. will just block and retry command submission, PBS does not (or at least our installation). Or it depends on the configuration and that can lead to funny behaviour. Anyways the developer should be aware of the fact and the TimeoutException might do just the right thing. Currently BE only offers the base class or interface, but this can then be used to force the exception.

LSF uses strange duration format, which causes exception in setJobInfoForJobDetails

This is especially problematic because LsfJobManager calls setJobInfoForJobDetails every time job states are updated, not only when statistics are requested.

Command executed by BE:

bjobs -noheader -a -o "jobid job_name stat user queue job_description proj_name job_group job_priority pids exit_code from_host exec_host submit_time start_time finish_time cpu_used run_time user_group swap max_mem runtimelimit sub_cwd pend_reason exec_cwd output_file input_file effective_resreq exec_home slots delimiter='<'" -u otptest 

Output of bjobs:

5935<otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob<DONE<otptest<medium-dmg<-<default<-<-<-<-<tbi-cn010<tbi-cn010<Sep 29 15:06<Sep 29 15:06<Sep 29 15:06 L<0.1 second(s)<0 second(s)<-<-<-<20.0/tbi-cn010<$HOME<-</home/otptest</ibios/dmdc/otp/workflow-tests/tmp/DataInstallationWorkflow-klinga-2017-09-29-15-05-07-297+0200-9AFOIiRNYiMw66qP/logging_root_path/clusterLog/2017-09-29/otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob.o5935<-<select[type == any ] order[r15s:pg] </home/otptest<1
5934<otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob<DONE<otptest<medium-dmg<-<default<-<-<-<-<tbi-cn010<tbi-cn010<Sep 29 15:06<Sep 29 15:06<Sep 29 15:06 L<0.1 second(s)<0 second(s)<-<-<-<20.0/tbi-cn010<$HOME<-</home/otptest</ibios/dmdc/otp/workflow-tests/tmp/DataInstallationWorkflow-klinga-2017-09-29-15-05-07-297+0200-9AFOIiRNYiMw66qP/logging_root_path/clusterLog/2017-09-29/otp_workflow_test_pid_14_DataInstallationWorkflow_1_CopyFilesJob.o5934<-<select[type == any ] order[r15s:pg] </home/otptest<1

Exception:

Duration string is not of the format HH+:MM:SS: '0.1 second(s)'
Error |
de.dkfz.roddy.BEException: Duration string is not of the format HH+:MM:SS: '0.1 second(s)'
Error |
	at de.dkfz.roddy.execution.jobs.cluster.ClusterJobManager.parseColonSeparatedHHMMSSDuration(ClusterJobManager.groovy:104)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.setJobInfoForJobDetails(LSFJobManager.groovy:475)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.updateJobStatus(LSFJobManager.groovy:373)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.queryJobStatusAll(LSFJobManager.groovy:82)
Error |
	at de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$queryJobStatusAll$2.call(Unknown Source)
Error |
	at de.dkfz.tbi.otp.job.processing.ClusterJobSchedulerService.retrieveKnownJobsWithState(ClusterJobSchedulerService.groovy:163)

Build fails (unknown git option)

Build fails on CentOS 7 with this message:

$ gradle jar
Starting a Gradle Daemon (subsequent builds will be faster)

FAILURE: Build failed with an exception.

* Where:
Build file '/data/code/BatchEuphoria/build.gradle' line: 22

* What went wrong:
A problem occurred evaluating root project 'BatchEuphoria'.
> Command failed with code 129: '/usr/bin/env git -C /data/code/BatchEuphoria describe --tags --dirty'
  Error output: 'Unknown option: -Cusage: git [--version] [--help] [-c name=value]           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]           <command> [<args>]'

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Make mail sending configurable

Available options in PBS and LSF:

PBS

-m | mail_options | Defines the set of conditions under  which the execution server will send a mail message about the job. The  mail_options argument is a string which consists of either the single  character "n", or one or more of the characters "a", "b", and "e".
                                 If the character "n" is specified, no  normal mail is sent. Mail for job cancels and other events outside of  normal job processing are still sent.
                                 For the letters "a", "b", and "e": 
                                     a – Mail is sent when the job is aborted by the batch system.
                                     b – Mail is sent when the job begins execution.
                                     e – Mail is sent when the job terminates.
                                  If the -m option is not specified, mail will be sent if the job is aborted.

-M | user_list | Declares the list of users to whom mail is sent by the execution server when it sends mail about the job.
                                 The user_list argument is of the form:
                                 user[@host][,user[@host],...]
                                  If unset, the list defaults to the submitting user at the qsub host, i.e. the job owner.

LSF

-u mail_user
    Sends mail to the specified email destination. To specify a Windows user account, include the domain name in uppercase letters and use a single backslash (DOMAIN_NAME\user_name) in a Windows command line or a double backslash (DOMAIN_NAME\\user_name) in a UNIX command line.

-N
    Sends the job report to you by mail when the job finishes. When used without any other options, behaves the same as the default.
    Use only with -o, -oo, -I, -Ip, and -Is options, which do not send mail, to force LSF to send you a mail message when the job is done.

-B
    Sends mail to you when the job is dispatched and begins execution.

set -o or -oo to disable sending mail; if not log is required, set -o /dev/null

Introduce getter for "JOB_ID" environment variable and use consistent format

Introduce getters for "*_JOBID" and similar scheduler specific environment variables in ClusterJobManager interface and implementations.
For OTP we only need JOBID at the moment, but other variables might also be useful for others. A list can be found here under Environment.
PBS_SCRATCH_DIR (used in PBSJobManager) is not a PBS specific variable, but specific to our cluster and maybe shouldn't be in BE.
The string should be in the same format, preferably without ${}, so they can also be used in other places than bash scripts.

Method to query job state for all jobs

It would be great if BE had a method to query the job states for all jobs of a user.
I noticed that it already updates the all job states internally, and for OTP it would be useful to expose that information.

Refactor/reconsider convertResources method

Comment form @dankwart-de:

Please also unify the convertResources method across the classes. There are code duplicates and missing functionality e.g. in the GE JobM. However, we should also make sure, that we convert the resource set entries to the right target values.

Constructor of RestExecutionService fails

with this exception:

java.lang.NullPointerException
	at java.time.Duration.between(Duration.java:473)
	at de.dkfz.roddy.execution.RestExecutionService.execute(RestExecutionService.groovy:171)
	at de.dkfz.roddy.execution.RestExecutionService.logon(RestExecutionService.groovy:111)
	at de.dkfz.roddy.execution.RestExecutionService.<init>(RestExecutionService.groovy:92)
	at de.dkfz.tbi.otp.job.processing.ClusterJobManagerFactoryService.getJobManager(ClusterJobManagerFactoryService.groovy:36)
…

Test coverage is not so good

There are only two (!) unit tests for PBS cluster operations, but every method in every JobManager should be tested.

Handle location of log files for different systems

In PBS, if you pass

  • nothing: a file a file [jobname].[o|e][jobid] in your home directory
  • -o/-e with an existing directory, it will write a file [jobname].[o|e][jobid] in that directory
  • -o/-e with a filename in an existing directory, it will use that filename, if the parent directory doesn't exist, nothing is written

You need to pass -joe to combine stdout and stderr
To write nothing, the you need to use the -k option
Variable "$PBS_JOBID" in the path will be replaced by the job id

In LFS, if you pass

  • nothing: nothing is written
  • -oo/-eo with an existing directory, it will write a file [jobid].out in that directory
  • -oo/-eo with a filename in an existing directory, it will use that filename, if the parent directory doesn't exist, nothing is written

Stderr is written to the same file as stdout if you don't pass -eo/-e; it seems that also happens if you pass the same directory to both -oo and -eo.
%J in the path will be replaced by the job id, and %I by the array index

BE should make sure that if you use the same options, the same files are written with all systems.

PbsJobManager failed to read runtime statistics

This exception occurred today in OTP:

2017-09-15 09:34:35,717 [taskScheduler-8] INFO  scheduler.ClusterJobMonitoringService  - 15227597 finished on Realm Realm 28110774 DKFZ_13.1 DAT
A_MANAGEMENT production
2017-09-15 09:34:35,972 [taskScheduler-8] WARN  scheduler.ClusterJobMonitoringService  - Failed to fill in runtime statistics for Cluster job 15
227597 on Realm 28110774 DKFZ_13.1 DATA_MANAGEMENT production with user otp
java.time.format.DateTimeParseException: Text cannot be parsed to a Duration
        at java.time.Duration.parse(Duration.java:412)
        at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager$_processQstatOutput_closure15.doCall(PBSJobManager.groovy:776)
        at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager.processQstatOutput(PBSJobManager.groovy:735)
        at de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager.queryExtendedJobStateById(PBSJobManager.groovy:591)
        at de.dkfz.tbi.otp.job.processing.ClusterJobSchedulerService.retrieveAndSaveJobStatistics(ClusterJobSchedulerService.groovy:153)
        at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService$_check_closure5$_closure9.doCall(ClusterJobMonitoringService.groovy:132)
        at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService$_check_closure5.doCall(ClusterJobMonitoringService.groovy:121)
        at de.dkfz.tbi.otp.job.scheduler.ClusterJobMonitoringService.check(ClusterJobMonitoringService.groovy:116)
        at de.dkfz.tbi.otp.job.scheduler.SchedulerService.clusterJobCheck(SchedulerService.groovy:301)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Code example in README doesn't work

  • JobBEJob
  • toolScript argument of BEJob constructor shouldn't be double escaped
  • There's no explanation that jobs are held by default and need to be released

We need to discuss, if we want to allow configuration paths of e.g. bsub binary configuration

Our admin just explained me, that a default and standard behaviour for non interactive sessions is, that e.g. bsub is not in the path, when it is called via ssh.

For our PBS cluster we had a SUSE system which was behaving in a non-standard way (as told to me). There, the .bashrc was also loaded in non-interactive sessions.

Our LSF cluster behaves standard conform (as explained) and there we'd either need to add something to the .bashrc or .bash_profile OR we need to call bsub/bjobs etc. with the full path.

For this case, we could introduce:

  • A check, if the submission binaries are accessible
  • Informational messages, if they could not be found and what to do
  • A way to configure the paths of used submission binaries

Don't depend on SNAPSHOT versions

When using SNAPSHOT versions, the dependency is downloaded every time. this also happens for transitive dependencies, such as when compiling OTP.

RestExecutionService.logon doesn't encode password

The server sends the following response because the password contains an &.

2017-09-26 15:54:04,565 [pool-4-thread-1] DEBUG http.wire  - http-outgoing-1 << "JAXBException occurred : The entity name must immediately follow the '&' in the entity reference.. The entity name must immediately follow the '&' in the entity reference.. "

JobStates and their documentation could be improved

These states are never set and maybe could be deleted:

  • UNKNOWN_SUBMITTED: Recently submitted job, jobState is unknown (submitted jobs are actually set to HOLD, QUEUED or FAILED)
  • UNKNOWN_READOUT, UNSTARTED, FAILED_POSSIBLE
    (if they are required for Roddy document that they are not set by BE and users don't have to handle them)

For these states the documentation could be improved:

  • FAILED: BEJob has failed: this is only used when an error occurred while submitting the job
  • OK: BEJob was ok: this seems to be used by array head jobs only? It is not exactly clear what ok means (maybe use COMPLETED_SUCCESSFULLY if the cluster scheduler can return whether a job was successful?)
  • ABORTED

LsfJobManager: bjobs command is escaped wrong

Output when executing:

$ bjobs -noheader -o \"jobid job_name stat user queue job_description proj_name job_group job_priority pids exit_code from_host exec_host submit_time start_time finish_time cpu_used run_time user_group swap max_mem runtimelimit sub_cwd pend_reason exec_cwd output_file input_file effective_resreq exec_home slots delimiter=\'\<\'\" -u otptest 

job_name: Illegal job ID.

' shouldn't be escaped, " only once

Add configuration for path of cluster tools

On some systems, the cluster tools such as qsub, bsub, sbatch etc, might be in a directory which is not in the PATH variable. It might be useful to have an optional configuration parameter to pass this directory to BE.

Lock not released in PbsJobManager.runJob

In PbsJobManager.runJob, a lock (cacheLock) is acquired but is never released, which will cause the methods using it to wait forever if they're called from different threads.

LsfJobManager: Job statistics not available

bjobs is called without the option -a, which is required to also show finished jobs.

Form the manpage:

-a

    Displays information about jobs in all states, including finished jobs that finished recently, within an interval specified by CLEAN_PERIOD in lsb.params (the default period is 1 hour).

    Use -a with -x option to display all jobs that have triggered a job exception (overrun, underrun, idle).

Several problems when using toolScript

When using toolString,

  • calling PbsCommand.toString results in an NPE
  • calling BEJob.toString results in an NPE
  • calling PbsJobManager.createCommand results in an NPE
  • the PBS submission command is not created correctly (should pipe the toolScript into qsub, instead of using it as an argugment)

Gradle build fails

When trying to run gradle build, it fails with the following output:

FAILURE: Build failed with an exception.

* Where:
Build file '/data/code/BatchEuphoria/build.gradle' line: 21

* What went wrong:
A problem occurred evaluating root project 'BatchEuphoria'.
> Command failed with code 128: '/usr/bin/env git -C /data/code/BatchEuphoria describe --tags --dirty'
  Error output: 'fatal: No names found, cannot describe anything.'

Build fails

de.dkfz.roddy.execution.cluster.pbs.PBSCommandParserTest > testConstructAndParseSimpleJob FAILED
    org.codehaus.groovy.runtime.powerassert.PowerAssertionError at PBSCommandParserTest.groovy:49

Don't know why. The test works fine on my local machine and does not use the file or cluster system.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.