scalableminds / cluster_tools Goto Github PK
View Code? Open in Web Editor NEWTask distribution with Slurm and multi-processing for Python
License: MIT License
Task distribution with Slurm and multi-processing for Python
License: MIT License
When running tasks it would convenient to see the log of subtasks directly in the console instead of fishing them out of the slurm logs. Both stdout and stderr. This could be a flag of the executor.
The code that calls executor.submit()
should not know which executor is used. The SlurmExecutor
currently accepts additional_setup_lines
and job_resources
arguments. We should move these to the constructor.
this allows to inspect the sbatch shell for running jobs
Quoting @striebel:
We should think about an alternative to those all_args, pinging @philippotto
Also I noted this pattern:
args = zip_with_scalar(…)
executor.map(f, args)
Maybe the executor could have direct support for this (e.g.
executor.map(func, scalar_arg, *constant_args)
?
Running a job using cluster_tools==1.39
, I get the following (truncated) call stack:
File "/gaba/u/georgwie/segmentation-tools-spineheads/segmentation_tools/segmentation/abstract_tasks/distributable_task.py", line 152, in execute
futures = executor.map_to_futures(self.step, [chunk for i, chunk in chunks])
File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 250, in map_to_futures
jobid = self._start(workerid, job_count, job_name)
File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 90, in _start
job_count=job_count,
File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/slurm.py", line 106, in inner_submit
return self.submit_text("\n".join(script_lines))
File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/slurm.py", line 71, in submit_text
jobid, _ = chcall("sbatch --parsable {}".format(filename))
File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/util.py", line 49, in chcall
raise CommandError(command, code, stderr)
cluster_tools.util.CommandError: 'sbatch --parsable artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut/_temp_slurmX2u3Ik2ptVGmQDA5WRu4mzYnfKUZEYCD.sh' exited with status 1: b'sbatch: error: Batch job submission failed: Invalid job array specification\n'
The content of the mentioned bash file is:
#!/bin/sh
#SBATCH --output=artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut/slurmpy.stdout.%A_%a.log
#SBATCH --job-name "predict_step"
#SBATCH --array=0-16199
#SBATCH --mem=50G
srun /u/georgwie/conda-envs/segm-3.6_spineheads/bin/python -m cluster_tools.remote Fh3gCXzuMXE1joZxEMly0KdKBM03i97t artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut
Useful in conjunction with #6
For example, slurm expects 10G
while PBS demands 10GB
. Right now, failing to provide this correctly will result in a silent error.
get_executor(environment, *args, **kwargs)
fails if keyword args are passed that the executor doesn't understand. These arguments should be discarded so that the caller doesn't have to worry about passing the right arguments (it would just pass all arguments that could be relevant).
It might make sense to explicitly white-list arguments for each executor so that in the future, we can also do a translation between different executors (e.g. slurm and PBS (#14)).
This issue is blocking https://github.com/scalableminds/segmentation-tools/issues/421
I've had this exception running a long-running job on gaba.
The job seems to have completed successfully (/tmpscratch/georgwie/artifacts/default/compute_segment_types/segment_types_sorted_probabilities__45f3b33b5c/cfut/slurmpy.stdout.15280248.log
). It seems like this is a problem with streaming the log. Not sure what could be the problem or if this happens often.
Traceback (most recent call last):
File "/u/georgwie/conda-envs/segm-3.6_1/bin/vx", line 11, in <module>
load_entry_point('voxelytics', 'console_scripts', 'vx')()
File "/u/georgwie/voxelytics1/voxelytics/__main__.py", line 311, in main
workflow.run(task_selection, strategy_for_task=strategy_for_task)
File "/u/georgwie/voxelytics1/voxelytics/commons/workflow/workflow.py", line 255, in run
execute()
File "/u/georgwie/voxelytics1/voxelytics/commons/workflow/workflow.py", line 230, in execute
task.execute()
File "/u/georgwie/voxelytics1/voxelytics/connect/abstract_tasks/distributable_task.py", line 191, in execute
executor.forward_log(fut)
File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 353, in forward_log
tailer.follow(2)
File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/site-packages/cluster_tools/tailf.py", line 33, in follow
line = file_.readline()
File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data
Slurm executor's submit()
should return a job ID (or maybe it can be a property of the future), so that we can print it if we want to.
We see <unknown function>
in the squeue, maybe the partial function name can be used?
When trying to schedule too many tasks, slurm's config can make the submission fail. To avoid the limits, the tasks can be chunked accordingly.
See #103 for a reference of implementation.
We could wrap/subclass the futures objects and intercept calls to result & co so that we know whether the call side has checked for errors. If not (and errors exist), we could output a warning.
The resumable executor of voxelytics checks what needs to be re-run. If that list becomes empty, map_to_futures crashes
User logger.debug instead of print + debug flag
This occurred in multiple stages of the alignment pipeline, the job logfiles looked as though they completed normally. This crashes the main program, so I’d consider it fairly important
It would be great to have an executor (e.g. DebugExecutor
) that just wraps a normal for-loop. This should only be used for debugging purposes, but it would be very helpful in such cases. One use case is e.g. when running tests, dropping into pdb is not possible in futures, the resulting shell is in the context of the caller then.
The slurm executor's constructor should accept a task name argument and name jobs accordingly.
We've experienced that on some clusters, the jobs might fail in a way that simply retrying them would help. For example, some nodes might not have the necessary libraries installed and restarting would likely schedule the job on a healthy node.
There are Slurm primitives to implement this (Manuel is currently implementing that on the Matlab Pipeline).
If the provided cfut dir is a symlink, the directory should be created at it's target.
Dependabot couldn't authenticate with https://pypi.python.org/simple/.
You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.
We currently have a sequential executor, but it's not the same as running the code synchronously. For example, in the case of an error, the next job is executed, instead of failing immediately.
I think it would be helpful to have an executor which executes its function directly on submit()
and returns a resolved future.
When submitting subjobs, the function and scalar arguments (now part of the function due to partial
) are pickled once per subjob, even though all of those are identical. Maybe it is enough to pickle it once and have all subjobs read it
squeue and log prints use <taskid>_<subid>
, but the log files are stored in slurmpy.stdout.<taskid>.<subid>.log
. Would be great to unify that.
Side note: Is slurmpy
and .cfut
still a good naming choice?
We should implement a helper function that can be used by all projects, like get_executor(environment: str) -> Executor
.
Environments should include slurm
, multiprocessing
and sequential
. It should also accept optional keyword arguments for each executor type.
e.g, function.__module__ + function.__name__
When we start large jobs, the cluster is completely full. It would be nice to be able to specify something like "Use 80% of the cluster" so that smaller tasks are still scheduled.
I don't understand slurm enough how we would specify this. For example, with scontrol update jobid=$JOB arraytaskthrottle=$N
, you can set a limit on the number of array jobs run simultaneously.
On the slurm cluster, jobs sometimes get killed by slurm, e.g. because memory requirements have been exceeded. The slurm executor does not realizes this and scripts wait indefinitely for the futures to complete.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.