benedictpaten / jobtree Goto Github PK
View Code? Open in Web Editor NEWPython based pipeline management software for clusters (but checkout toil: https://github.com/BD2KGenomics/toil, its successor)
License: MIT License
Python based pipeline management software for clusters (but checkout toil: https://github.com/BD2KGenomics/toil, its successor)
License: MIT License
Commit 76a4328 broke argparse functionality by failing to maintain the new method of displaying default values in help strings. The old, optparse method is to use
help="blah blah blah default=%default"
and the argparse method, is to use
help="blah blah blah default=%(default)s"
The two methods are incompatible with one another, which was why we had two methods to handle this.
optparse is deprecated, argparse is the replacement. This can be patched by checking the .class of the input in the addOptions() function contained in jobTree.src.jobTreeRun BUT will also need to be changed in sonLib.bioio addLoggingOptions(). I think the way jobTree and sonLib have used optparse is generic enough that the transition would be transparent to script writers.
Hi Benedict,
I got some error messages when I ran "make test"
Starting to create the job tree setup for the first time
Traceback (most recent call last):
File "/projects/gec/tool/assemblathon1/jobTree/bin/scriptTreeTest_Sort.py", line 119, in <module>
main()
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/sort/scriptTreeTest_Sort.py", line 112, in main
i = Stack(Setup(options.fileToSort, int(options.N))).startJobTree(options)
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/scriptTree/stack.py", line 112, in startJobTree
config, batchSystem = createJobTree(options)
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 221, in createJobTree
batchSystem = loadTheBatchSystem(config)
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 157, in loadTheBatchSystem
batchSystem = GridengineBatchSystem(config)
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 136, in __init__
self.obtainSystemConstants()
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 267, in obtainSystemConstants
p = subprocess.Popen(["qhost"], stdout = subprocess.PIPE,stderr = subprocess.STDOUT)
File "/usr/lib64/python2.6/subprocess.py", line 595, in __init__
errread, errwrite)
File "/usr/lib64/python2.6/subprocess.py", line 1106, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
E
======================================================================
ERROR: Uses the jobTreeTest code to test the scriptTree Target wrapper.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 31, in testScriptTree_Example
system(command)
File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_KEwg5sdOhb/jobTree --logLevel=INFO --retryCount=10 exited with non-zero status 256
======================================================================
ERROR: Tests that the global and local temp dirs of a job behave as expected.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 39, in testScriptTree_Example2
system(command)
File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper2.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_GKficygzKn/jobTree --logLevel=INFO --retryCount=0 exited with non-zero status 256
======================================================================
ERROR: Tests the jobTreeStats utility using the scriptTree_sort example.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/utilities/statsTest.py", line 35, in testJobTreeStats_SortSimple
system(command)
File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Sort.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK92206Q/jobTree --logLevel=DEBUG --fileToSort=/gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK922
06Q/tmp_wj73HRVMZx --N 1000 --stats --jobTime 0.5 exited with non-zero status 256
----------------------------------------------------------------------
Ran 14 tests in 318.742s
FAILED (errors=3)
Could I ask what it happened?
Thank you.
Regards,
Yun
Apart from the test data, why is jobTree dependent on sonLib? Free it from the shackles of unrelated projects!
I just had several jobTrees fail (presumably due to a filesystem problem) with this error:
Batch system is reporting that the job (1, 298079848) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079849) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079823) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job failed with exit value 256
There was a .new file for the job and no .updating file /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job
Traceback (most recent call last):
File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in <module>
main()
File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
return mainLoop(config, batchSystem)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
job = Job.read(jobFile)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expected
I haven't looked into jobTree internals in depth, but I think this is due to the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has size 0). I'll look to see if we can be resilient to these types of errors and just retry the job if this happens. It wouldn't have helped in this particular case, since all jobs were failing in the same way, but presumably this could also happen if a single job is killed in the wrong way.
The tmp_* directories made during make test
do not need to have full permissions for the 'other', please consider limiting them as a security precaution.
I've installed jobTree on Red Hat Enterprise Linux Server release 6.3 (Santiago) with Python 2.7.3 and SGE 8.1.8, but when I run make test I get some errors:
https://dl.dropboxusercontent.com/u/34969406/test.txt
Can you please help me figure out what's going wrong? Thanks!
This should be removed, even if it duplicates a small amount of library code (the amount is trivial, I think).
When I go to restart one of my jobTree scripts that uses the global and local temp directories, after some of its jobs have I get a crash from the internal jobTree code complaining about some directories not existing.
Can the code be made to handle the lack of existence of those directories?
log.txt: ---JOBTREE SLAVE OUTPUT LOG---
log.txt: Traceback (most recent call last):
log.txt: File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt: File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 153, in execute
log.txt: self.target.run()
log.txt: File "/cluster/home/anovak/hive/sgdev/mhc/targets.py", line 244, in run
log.txt: index_dir = sonLib.bioio.getTempFile(rootDir=self.getGlobalTempDir())
log.txt: File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/target.py", line 103, in getGlobalTempDir
log.txt: self.globalTempDir = self.stack.getGlobalTempDir()
log.txt: File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 129, in getGlobalTempDir
log.txt: return getTempDirectory(rootDir=self.globalTempDir)
log.txt: File "/cluster/home/anovak/.local/lib/python2.7/site-packages/sonLib-1.0-py2.7.egg/sonLib/bioio.py", line 457, in getTempDirectory
log.txt: os.mkdir(rootDir)
log.txt: OSError: [Errno 20] Not a directory: '/cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/gTD0/tmp_Zss3uyl5X6/tmp_45OevDhWor/tmp_vxiVIbzGSw'
log.txt: Exiting the slave because of a failed job on host ku-1-21.local
log.txt: Due to failure we are reducing the remaining retry count of job /cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/job to 0
The basic make test takes ages... On a singleMachine I see a few dozen processes, all of them doing strictly nothing. Does this mean that some kind of race condition was hit? Or that the test was designed to test me?
Per our conversation, jobTree sometimes loses track of where imports are coming from.
Reporting file: /hive/users/dearl/alignathon/testPSAR/jobTree_flies_reg2_swarm/jobs/tmp_IqnDTgr8uv/tmp_AzbE0HYFWt/tmp_70td82ax1K/log.txt
log.txt: Parsed arguments and set up logging
log.txt: Traceback (most recent call last):
log.txt: File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 206, in main
log.txt: loadStack(command).execute(job=job, stats=stats,
log.txt: File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 53, in loadStack
log.txt: _temp = __import__(moduleName, globals(), locals(), [className], -1)
log.txt: ImportError: No module named batchPsar
log.txt: Exiting the slave because of a failed job on host kkr18u44.local
log.txt: Finished running the chain of jobs on this node, we ran for a total of 5.177906 seconds
Requires explicit setting of PYTHONPATH when running on swarm, when running on kolossus simply running from the same directory as the script is sufficient.
Courtesy of Tim Sackton:
I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.
Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.
Is this a bug or an error in my code/understanding? It could easily be the latter....
Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.
When restarting a jobTree parameter changes to the script (like changing the batch system) do not have any effect, because such parameters are overwritten when reading the state of the earlier jobTree. This is poor behaviour and should be fixed - at least warnings should be issued.
Obviously - though consideration needs to be made of the command line binaries - this could possibly be managed via the python path, removing the command line binaries.
because I am dumb and this consistently confuses me.
This has been a problem for a while, but I'm just putting an issue up so I remember to fix this somehow.
When parasol has more than a million or so jobs queued, like now, the periodic "parasol -extended list jobs" command that jobTree runs hangs the entire parasol hub process for a couple minutes while it gets a listing of every job. This sucks, since it means that the cluster nodes start to go idle waiting for work, since the hub can't issue new jobs while it it's busy sending the list of queued jobs to jobTree. This gets even worse when there are a few jobTrees running; the cluster sometimes sits completely idle for several minutes.
We (read: I) should try to find some way around listing every job, maybe by looking to see if there's a way we can get the same information, but limited to just the jobTree batch rather than all batches. If there isn't a way currently, maybe modify parasol to include that functionality.
It seems possible for long running jobs (wall clock) to have empty stats.xml files despite use of --stats
, perhaps due to infrequent collating of the job statistics.
As the shitty xml way of specing jobs is dead - the documentation and code should not separate out script tree.
What does the use of colors, instead of descriptive english sentences, save you on a command line cluster management program? Because it costs your users brain space to store your chromatic cypher.
Furthermore, "dead" is not a color.
Using --maxThreads 1 --maxJobs 1 on swarm with parasol will cause a hang.
If a jobTree crashes, and nobody is watching STDERR, does is really throw an error?
The clunky class interface is too verbose - better to give the examples using the syntax for specifying targets as functions, which avoids the need for an init function.
The following options need to go into the correct jobtree option group, not stand on their own:
--logOff
--logInfo
--logDebug
--logLevel
--logFile
--rotatingLogging
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.