Giter Site home page Giter Site logo

Comments (7)

lonsbio avatar lonsbio commented on August 28, 2024

From [email protected] on 2012-06-13T14:47:30Z

The downside of too frequent checking results in large log files.

from bpipe.

lonsbio avatar lonsbio commented on August 28, 2024

From [email protected] on 2012-06-20T13:24:02Z

And relatively high CPU usage...

from bpipe.

lonsbio avatar lonsbio commented on August 28, 2024

From [email protected] on 2012-06-21T17:40:22Z

I definitely agree this would be good to make customizable (and very easy to do!).

I wonder if you see this as a single pipeline-wide configuration setting or something that might change per-command?

Also, perhaps rather than a fixed interval, perhaps some kind of exponential backoff might be appropriate? The idea being if it is a very short command, or a command that fails, it is good to avoid very large latency in getting the status, especially since in some systems the status for jobs that have finished may not persist very long. So I'm thinking about two values, a "minimum" poll interval and a "maximum" and Bpipe will do an exponential backoff between the two values.

Let me know any thoughts, and thanks for the suggestion!

Status: Started

from bpipe.

lonsbio avatar lonsbio commented on August 28, 2024

From [email protected] on 2012-06-21T22:13:44Z

You know if we want to run bpipe on TORQUE server, then we must create a file "bpipe.config" with one line: executor="torque". I think in my use case, add one more line looks like: frequency=600 (which means checking job's status every 10mins) should be good enough.

Yea I know in the "bpipe-torque.sh" the function status() is using "qstat" to checking the job status. And sometimes the systems won't keep the status for completed jobs very long. In fact, for the server my lab is using, it is configured that remove completed jobs immediately out of queue. So qstat in my case doesn't work at all. I asked the server support guys for help, and their major concern is the server's work load. As each qstat checking will set up a connection with the job scheduler.

To solve this problem, I modify the status() function to use "tracejob" instead of "qstat" to keep track of the job status. Following is my status() code:


get the status of a job given its id

status () {

make sure we have a job id on the command line

if [[ $# -ge 1 ]]
then
# look at the output of tracejob
trace_output=tracejob -a -l -m "$1"
trace_success=$?
if [[ $trace_success == 0 ]]
then
# XXX what to do if the awk fails?
job_state=echo "$trace_output" | grep 'COMPLETE'
if [[ -z $job_state ]]
then
job_state=echo "$trace_output" | grep 'Run'
if [[ -z $job_state ]]
then
echo WAITING
else
echo RUNNING
fi
else
job_state=echo "$trace_output" | awk 'match($0, /Exit_status=([0-9]+)/, a) {print a[1]}'
echo "COMPLETE $job_state"
fi
exit $SUCCESS
else
exit $TRACE_FAILED
fi
else
echo "$program_name ERROR: status requires a job identifier"
exit $STATUS_MISSING_JOBID
fi
}


So how does tracejob work is instead of initiate a connection with the job scheduler on the server and checking the queue status, it will read the server log files which will be kept for several days to weeks even after the job is done. But the downside of it is only head node of the server have access to those log files. However, this works for my specific situation.

Any thoughts?

from bpipe.

lonsbio avatar lonsbio commented on August 28, 2024

From [email protected] on 2012-06-22T19:57:54Z

Thanks for the followup thoughts - I will discuss this with the author of the Torque support in Bpipe (it wasn't me) and get him to follow up.

from bpipe.

lonsbio avatar lonsbio commented on August 28, 2024

From [email protected] on 2012-06-25T01:23:55Z

I agree that polling qstat is not ideal, and certainly no good if your Torque installation removes jobs immediately.

Our sys admins were kind enough to extend the time jobs are retained for a while after they complete (personally I think this is a reasonable thing to do, but it depends on the system you are using).

As you say, tracejob can be used as a workaround, but also as you say, it has issues with privileges.

The Torque manual says:

"To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. "

For example, regular users on our system cannot use tracejob.

We have found that 10 seconds is reasonable for polling qstat, as long as the job records are kept for a short time after the jobs have completed.

Perhaps we should bite the bullet and look at using a library such as DRMAA ( http://www.drmaa.org/ ) to launch jobs? I've had it on my whiteboard for months now :)

from bpipe.

lonsbio avatar lonsbio commented on August 28, 2024

From [email protected] on 2012-06-25T21:54:32Z

I've been advised by some knowledgable folks of a couple of things that might help.

  1. qsub supports an "x" flag which can be used in addition to the -I (capital i) for interactive jobs.

This allows you to execute a command and wait for it to complete:

$ cat hostname.sh
#!/bin/bash
hostname

$ qsub -Ix ./hostname.sh
qsub: waiting for job 1026513 to start
qsub: job 1026513 ready

bruce009

qsub: job 1026513 completed

  1. Moab's showq supports a -c option which shows information about complete jobs for JOBCPURGETIME (default 5 minutes), including the exit code (and it has a --xml option). Of course this requires that the site is using moab in addition to torque.

from bpipe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.