benedictpaten / jobtree Goto Github PK

Python based pipeline management software for clusters (but checkout toil: https://github.com/BD2KGenomics/toil, its successor)

License: MIT License

Makefile 0.43% Python 99.57%

jobtree's People

Contributors

Stargazers

Watchers

Forkers

cvaske sbenz decarlin tk2 adamnovak lazycrazyowl ifiddes harvardinformatics bd2kgenomics cooketho artrand pombredanne kaspermunch nagyistge 0820ll longbow0

jobtree's Issues

argparse functionality broken by 76a4328

Commit 76a4328 broke argparse functionality by failing to maintain the new method of displaying default values in help strings. The old, optparse method is to use

    help="blah blah blah default=%default"

and the argparse method, is to use

    help="blah blah blah default=%(default)s"

The two methods are incompatible with one another, which was why we had two methods to handle this.

76a4328

jobTree support for argparse

optparse is deprecated, argparse is the replacement. This can be patched by checking the .class of the input in the addOptions() function contained in jobTree.src.jobTreeRun BUT will also need to be changed in sonLib.bioio addLoggingOptions(). I think the way jobTree and sonLib have used optparse is generic enough that the transition would be transparent to script writers.

errors in "make test"

Hi Benedict,

I got some error messages when I ran "make test"


Starting to create the job tree setup for the first time
Traceback (most recent call last):
  File "/projects/gec/tool/assemblathon1/jobTree/bin/scriptTreeTest_Sort.py", line 119, in <module>
    main()
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/sort/scriptTreeTest_Sort.py", line 112, in main
    i = Stack(Setup(options.fileToSort, int(options.N))).startJobTree(options)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/scriptTree/stack.py", line 112, in startJobTree
    config, batchSystem = createJobTree(options)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 221, in createJobTree
    batchSystem = loadTheBatchSystem(config)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 157, in loadTheBatchSystem
    batchSystem = GridengineBatchSystem(config)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 136, in __init__
    self.obtainSystemConstants()
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 267, in obtainSystemConstants
    p = subprocess.Popen(["qhost"], stdout = subprocess.PIPE,stderr = subprocess.STDOUT)
  File "/usr/lib64/python2.6/subprocess.py", line 595, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1106, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
E
======================================================================
ERROR: Uses the jobTreeTest code to test the scriptTree Target wrapper.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 31, in testScriptTree_Example
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_KEwg5sdOhb/jobTree --logLevel=INFO --retryCount=10 exited with non-zero status 256

======================================================================
ERROR: Tests that the global and local temp dirs of a job behave as expected.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 39, in testScriptTree_Example2
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper2.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_GKficygzKn/jobTree --logLevel=INFO --retryCount=0 exited with non-zero status 256

======================================================================
ERROR: Tests the jobTreeStats utility using the scriptTree_sort example.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/utilities/statsTest.py", line 35, in testJobTreeStats_SortSimple
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Sort.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK92206Q/jobTree --logLevel=DEBUG --fileToSort=/gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK922
06Q/tmp_wj73HRVMZx --N 1000 --stats --jobTime 0.5 exited with non-zero status 256

----------------------------------------------------------------------
Ran 14 tests in 318.742s

FAILED (errors=3)

Could I ask what it happened?
Thank you.

Regards,

Yun

Please reconsider dependency on sonLib

Apart from the test data, why is jobTree dependent on sonLib? Free it from the shackles of unrelated projects!

entire jobTree crashes when a job's pickle file isn't written correctly

I just had several jobTrees fail (presumably due to a filesystem problem) with this error:

Batch system is reporting that the job (1, 298079848) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079849) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079823) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job failed with exit value 256
There was a .new file for the job and no .updating file /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job
Traceback (most recent call last):
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in <module>
    main()
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
    Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
    return mainLoop(config, batchSystem)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
    processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
    job = Job.read(jobFile)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
    job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expected

I haven't looked into jobTree internals in depth, but I think this is due to the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has size 0). I'll look to see if we can be resilient to these types of errors and just retry the job if this happens. It wouldn't have helped in this particular case, since all jobs were failing in the same way, but presumably this could also happen if a single job is killed in the wrong way.

Please limit the permissions for the tmp_* directories build by `make test`

The tmp_* directories made during make test do not need to have full permissions for the 'other', please consider limiting them as a security precaution.

make test error

I've installed jobTree on Red Hat Enterprise Linux Server release 6.3 (Santiago) with Python 2.7.3 and SGE 8.1.8, but when I run make test I get some errors:

https://dl.dropboxusercontent.com/u/34969406/test.txt

Can you please help me figure out what's going wrong? Thanks!

Sonlib dependency is unnecessary

This should be removed, even if it duplicates a small amount of library code (the amount is trivial, I think).

Crash getting global temp dir when restarting jobtree

When I go to restart one of my jobTree scripts that uses the global and local temp directories, after some of its jobs have I get a crash from the internal jobTree code complaining about some directories not existing.

Can the code be made to handle the lack of existence of those directories?

log.txt:    ---JOBTREE SLAVE OUTPUT LOG---
log.txt:    Traceback (most recent call last):
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt:        defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 153, in execute
log.txt:        self.target.run()
log.txt:      File "/cluster/home/anovak/hive/sgdev/mhc/targets.py", line 244, in run
log.txt:        index_dir = sonLib.bioio.getTempFile(rootDir=self.getGlobalTempDir())
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/target.py", line 103, in getGlobalTempDir
log.txt:        self.globalTempDir = self.stack.getGlobalTempDir()
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 129, in getGlobalTempDir
log.txt:        return getTempDirectory(rootDir=self.globalTempDir)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/sonLib-1.0-py2.7.egg/sonLib/bioio.py", line 457, in getTempDirectory
log.txt:        os.mkdir(rootDir)
log.txt:    OSError: [Errno 20] Not a directory: '/cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/gTD0/tmp_Zss3uyl5X6/tmp_45OevDhWor/tmp_vxiVIbzGSw'
log.txt:    Exiting the slave because of a failed job on host ku-1-21.local
log.txt:    Due to failure we are reducing the remaining retry count of job /cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/job to 0

When do the tests end?

The basic make test takes ages... On a singleMachine I see a few dozen processes, all of them doing strictly nothing. Does this mean that some kind of race condition was hit? Or that the test was designed to test me?

Relative imports don't always work in jobTree

Per our conversation, jobTree sometimes loses track of where imports are coming from.

Reporting file: /hive/users/dearl/alignathon/testPSAR/jobTree_flies_reg2_swarm/jobs/tmp_IqnDTgr8uv/tmp_AzbE0HYFWt/tmp_70td82ax1K/log.txt
log.txt:        Parsed arguments and set up logging
log.txt:        Traceback (most recent call last):
log.txt:          File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 206, in main
log.txt:            loadStack(command).execute(job=job, stats=stats,
log.txt:          File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 53, in loadStack
log.txt:            _temp = __import__(moduleName, globals(), locals(), [className], -1)
log.txt:        ImportError: No module named batchPsar
log.txt:        Exiting the slave because of a failed job on host kkr18u44.local
log.txt:        Finished running the chain of jobs on this node, we ran for a total of 5.177906 seconds

Requires explicit setting of PYTHONPATH when running on swarm, when running on kolossus simply running from the same directory as the script is sufficient.

job failures not correctly reducing retry count

Courtesy of Tim Sackton:

I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.

Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.

Is this a bug or an error in my code/understanding? It could easily be the latter....

Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.

restarting jobTree the parameter changes have no effect

When restarting a jobTree parameter changes to the script (like changing the batch system) do not have any effect, because such parameters are overwritten when reading the state of the earlier jobTree. This is poor behaviour and should be fixed - at least warnings should be issued.

Should make package pip installable

Obviously - though consideration needs to be made of the command line binaries - this could possibly be managed via the python path, removing the command line binaries.

When --batchSystem != singleMachine, --maxThreads should spark an error

because I am dumb and this consistently confuses me.

periodically hangs entire parasol hub when listing jobs

This has been a problem for a while, but I'm just putting an issue up so I remember to fix this somehow.

When parasol has more than a million or so jobs queued, like now, the periodic "parasol -extended list jobs" command that jobTree runs hangs the entire parasol hub process for a couple minutes while it gets a listing of every job. This sucks, since it means that the cluster nodes start to go idle waiting for work, since the hub can't issue new jobs while it it's busy sending the list of queued jobs to jobTree. This gets even worse when there are a few jobTrees running; the cluster sometimes sits completely idle for several minutes.

We (read: I) should try to find some way around listing every job, maybe by looking to see if there's a way we can get the same information, but limited to just the jobTree batch rather than all batches. If there isn't a way currently, maybe modify parasol to include that functionality.

--logOff         
--logInfo        
--logDebug       
--logLevel
--logFile
--rotatingLogging