The cluster_tools from scalableminds

Allow to influence additional setup lines via env var

Log forwarding

When running tasks it would convenient to see the log of subtasks directly in the console instead of fishing them out of the slurm logs. Both stdout and stderr. This could be a flag of the executor.

SlurmExecutor: Move slurm-specific arguments to constructor

The code that calls executor.submit() should not know which executor is used. The SlurmExecutor currently accepts additional_setup_lines and job_resources arguments. We should move these to the constructor.

Delete sbatch sh file only after the job was finished

this allows to inspect the sbatch shell for running jobs

Ensure that failed pickling does not create zombie futures

Copy Slurm Exectutor from cluster futures

clean up argument passing for subjobs

Quoting @striebel:

We should think about an alternative to those all_args, pinging @philippotto
Also I noted this pattern:

args = zip_with_scalar(…)
executor.map(f, args)

Maybe the executor could have direct support for this (e.g. executor.map(func, scalar_arg, *constant_args)?

React to SIGINT and handle remaining jobs (kill/print?)

Batch job submission failed: Invalid job array specification

Running a job using cluster_tools==1.39, I get the following (truncated) call stack:

  File "/gaba/u/georgwie/segmentation-tools-spineheads/segmentation_tools/segmentation/abstract_tasks/distributable_task.py", line 152, in execute
    futures = executor.map_to_futures(self.step, [chunk for i, chunk in chunks])
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 250, in map_to_futures
    jobid = self._start(workerid, job_count, job_name)
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 90, in _start
    job_count=job_count,
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/slurm.py", line 106, in inner_submit
    return self.submit_text("\n".join(script_lines))
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/slurm.py", line 71, in submit_text
    jobid, _ = chcall("sbatch --parsable {}".format(filename))
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/util.py", line 49, in chcall
    raise CommandError(command, code, stderr)
cluster_tools.util.CommandError: 'sbatch --parsable artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut/_temp_slurmX2u3Ik2ptVGmQDA5WRu4mzYnfKUZEYCD.sh' exited with status 1: b'sbatch: error: Batch job submission failed: Invalid job array specification\n'

The content of the mentioned bash file is:

#!/bin/sh
#SBATCH --output=artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut/slurmpy.stdout.%A_%a.log
#SBATCH --job-name "predict_step"
#SBATCH --array=0-16199
#SBATCH --mem=50G
srun /u/georgwie/conda-envs/segm-3.6_spineheads/bin/python -m cluster_tools.remote Fh3gCXzuMXE1joZxEMly0KdKBM03i97t artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut

Add map_unordered to executor interface

Useful in conjunction with #6

Validate resource requirements

For example, slurm expects 10G while PBS demands 10GB. Right now, failing to provide this correctly will result in a silent error.

White-list keyword arguments

get_executor(environment, *args, **kwargs) fails if keyword args are passed that the executor doesn't understand. These arguments should be discarded so that the caller doesn't have to worry about passing the right arguments (it would just pass all arguments that could be relevant).

It might make sense to explicitly white-list arguments for each executor so that in the future, we can also do a translation between different executors (e.g. slurm and PBS (#14)).

This issue is blocking https://github.com/scalableminds/segmentation-tools/issues/421

unexpected end of data

I've had this exception running a long-running job on gaba.

The job seems to have completed successfully (/tmpscratch/georgwie/artifacts/default/compute_segment_types/segment_types_sorted_probabilities__45f3b33b5c/cfut/slurmpy.stdout.15280248.log). It seems like this is a problem with streaming the log. Not sure what could be the problem or if this happens often.

Traceback (most recent call last):
  File "/u/georgwie/conda-envs/segm-3.6_1/bin/vx", line 11, in <module>
    load_entry_point('voxelytics', 'console_scripts', 'vx')()
  File "/u/georgwie/voxelytics1/voxelytics/__main__.py", line 311, in main
    workflow.run(task_selection, strategy_for_task=strategy_for_task)
  File "/u/georgwie/voxelytics1/voxelytics/commons/workflow/workflow.py", line 255, in run
    execute()
  File "/u/georgwie/voxelytics1/voxelytics/commons/workflow/workflow.py", line 230, in execute
    task.execute()
  File "/u/georgwie/voxelytics1/voxelytics/connect/abstract_tasks/distributable_task.py", line 191, in execute
    executor.forward_log(fut)
  File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 353, in forward_log
    tailer.follow(2)
  File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/site-packages/cluster_tools/tailf.py", line 33, in follow
    line = file_.readline()
  File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

Return Job ID

Slurm executor's submit() should return a job ID (or maybe it can be a property of the future), so that we can print it if we want to.

partial functions lead to unknown function name in slurm

We see <unknown function> in the squeue, maybe the partial function name can be used?

Avoid hitting slurm limits for maximum job count / array size

When trying to schedule too many tasks, slurm's config can make the submission fail. To avoid the limits, the tasks can be chunked accordingly.

See #103 for a reference of implementation.

Check whether futures were checked for failures when exiting executor scope

We could wrap/subclass the futures objects and intercept calls to result & co so that we know whether the call side has checked for errors. If not (and errors exist), we could output a warning.

Support PBS scheduler

Automate version bumping

https://pypi.org/project/setuptools-scm/

map_to_futures with empty list crashes, should return empty list

The resumable executor of voxelytics checks what needs to be re-run. If that list becomes empty, map_to_futures crashes

Logging output is swallowed when logging module is used

User logger.debug instead of print + debug flag

Job state is completed, but .cfut/….pickle couldn't be found

This occurred in multiple stages of the alignment pipeline, the job logfiles looked as though they completed normally. This crashes the main program, so I’d consider it fairly important

add executor that uses a for-loop

It would be great to have an executor (e.g. DebugExecutor) that just wraps a normal for-loop. This should only be used for debugging purposes, but it would be very helpful in such cases. One use case is e.g. when running tests, dropping into pdb is not possible in futures, the resulting shell is in the context of the caller then.

Automatically generate job names

The slurm executor's constructor should accept a task name argument and name jobs accordingly.

Restart Failed Jobs

We've experienced that on some clusters, the jobs might fail in a way that simply retrying them would help. For example, some nodes might not have the necessary libraries installed and restarting would likely schedule the job on a healthy node.

There are Slurm primitives to implement this (Manuel is currently implementing that on the Matlab Pipeline).

Allow to create cfut dir at symlink's target

If the provided cfut dir is a symlink, the directory should be created at it's target.

Implement map_to_futures for process pool

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

scalableminds / cluster_tools Goto Github PK

cluster_tools's People

Contributors

Stargazers

Watchers

Forkers

cluster_tools's Issues

Recommend Projects

Recommend Topics

Recommend Org