About the frequency of polling job status about bpipe HOT 7 OPEN

ssadedin commented on August 28, 2024

About the frequency of polling job status

from bpipe.

Comments (7)

lonsbio commented on August 28, 2024

From [email protected] on 2012-06-13T14:47:30Z

The downside of too frequent checking results in large log files.

from bpipe.

lonsbio commented on August 28, 2024

From [email protected] on 2012-06-20T13:24:02Z

And relatively high CPU usage...

from bpipe.

lonsbio commented on August 28, 2024

From [email protected] on 2012-06-21T17:40:22Z

I definitely agree this would be good to make customizable (and very easy to do!).

I wonder if you see this as a single pipeline-wide configuration setting or something that might change per-command?

Also, perhaps rather than a fixed interval, perhaps some kind of exponential backoff might be appropriate? The idea being if it is a very short command, or a command that fails, it is good to avoid very large latency in getting the status, especially since in some systems the status for jobs that have finished may not persist very long. So I'm thinking about two values, a "minimum" poll interval and a "maximum" and Bpipe will do an exponential backoff between the two values.

Let me know any thoughts, and thanks for the suggestion!

Status: Started

from bpipe.

lonsbio commented on August 28, 2024

From [email protected] on 2012-06-21T22:13:44Z

You know if we want to run bpipe on TORQUE server, then we must create a file "bpipe.config" with one line: executor="torque". I think in my use case, add one more line looks like: frequency=600 (which means checking job's status every 10mins) should be good enough.

Yea I know in the "bpipe-torque.sh" the function status() is using "qstat" to checking the job status. And sometimes the systems won't keep the status for completed jobs very long. In fact, for the server my lab is using, it is configured that remove completed jobs immediately out of queue. So qstat in my case doesn't work at all. I asked the server support guys for help, and their major concern is the server's work load. As each qstat checking will set up a connection with the job scheduler.

To solve this problem, I modify the status() function to use "tracejob" instead of "qstat" to keep track of the job status. Following is my status() code:

get the status of a job given its id

status () {

make sure we have a job id on the command line

if [[ $# -ge 1 ]]
then
# look at the output of tracejob
trace_output=tracejob -a -l -m "$1"
trace_success=$?
if [[ $trace_success == 0 ]]
then
# XXX what to do if the awk fails?
job_state=echo "$trace_output" | grep 'COMPLETE'
if [[ -z $job_state ]]
then
job_state=echo "$trace_output" | grep 'Run'
if [[ -z $job_state ]]
then
echo WAITING
else
echo RUNNING
fi
else
job_state=echo "$trace_output" | awk 'match($0, /Exit_status=([0-9]+)/, a) {print a[1]}'
echo "COMPLETE $job_state"
fi
exit $SUCCESS
else
exit $TRACE_FAILED
fi
else
echo "$program_name ERROR: status requires a job identifier"
exit $STATUS_MISSING_JOBID
fi
}

So how does tracejob work is instead of initiate a connection with the job scheduler on the server and checking the queue status, it will read the server log files which will be kept for several days to weeks even after the job is done. But the downside of it is only head node of the server have access to those log files. However, this works for my specific situation.

Any thoughts?

from bpipe.

lonsbio commented on August 28, 2024

From [email protected] on 2012-06-22T19:57:54Z

Thanks for the followup thoughts - I will discuss this with the author of the Torque support in Bpipe (it wasn't me) and get him to follow up.

from bpipe.

lonsbio commented on August 28, 2024

From [email protected] on 2012-06-25T01:23:55Z

I agree that polling qstat is not ideal, and certainly no good if your Torque installation removes jobs immediately.

Our sys admins were kind enough to extend the time jobs are retained for a while after they complete (personally I think this is a reasonable thing to do, but it depends on the system you are using).

As you say, tracejob can be used as a workaround, but also as you say, it has issues with privileges.

The Torque manual says:

"To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. "

For example, regular users on our system cannot use tracejob.

We have found that 10 seconds is reasonable for polling qstat, as long as the job records are kept for a short time after the jobs have completed.

Perhaps we should bite the bullet and look at using a library such as DRMAA ( http://www.drmaa.org/ ) to launch jobs? I've had it on my whiteboard for months now :)

from bpipe.

lonsbio commented on August 28, 2024

From [email protected] on 2012-06-25T21:54:32Z

I've been advised by some knowledgable folks of a couple of things that might help.

qsub supports an "x" flag which can be used in addition to the -I (capital i) for interactive jobs.

This allows you to execute a command and wait for it to complete:

$ cat hostname.sh
#!/bin/bash
hostname

$ qsub -Ix ./hostname.sh
qsub: waiting for job 1026513 to start
qsub: job 1026513 ready

bruce009

qsub: job 1026513 completed

Moab's showq supports a -c option which shows information about complete jobs for JOBCPURGETIME (default 5 minutes), including the exit code (and it has a --xml option). Of course this requires that the site is using moab in addition to torque.

from bpipe.

About the frequency of polling job status about bpipe HOT 7 OPEN

Comments (7)

get the status of a job given its id

make sure we have a job id on the command line

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent