Giter Site home page Giter Site logo

cluster_tools's People

Contributors

daniel-wer avatar dependabot-preview[bot] avatar fm3 avatar georgwiese avatar jstriebel avatar normanrz avatar philippotto avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster_tools's Issues

Log forwarding

When running tasks it would convenient to see the log of subtasks directly in the console instead of fishing them out of the slurm logs. Both stdout and stderr. This could be a flag of the executor.

clean up argument passing for subjobs

Quoting @striebel:

We should think about an alternative to those all_args, pinging @philippotto
Also I noted this pattern:

args = zip_with_scalar(…)
executor.map(f, args)

Maybe the executor could have direct support for this (e.g. executor.map(func, scalar_arg, *constant_args)?

Batch job submission failed: Invalid job array specification

Running a job using cluster_tools==1.39, I get the following (truncated) call stack:

  File "/gaba/u/georgwie/segmentation-tools-spineheads/segmentation_tools/segmentation/abstract_tasks/distributable_task.py", line 152, in execute
    futures = executor.map_to_futures(self.step, [chunk for i, chunk in chunks])
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 250, in map_to_futures
    jobid = self._start(workerid, job_count, job_name)
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 90, in _start
    job_count=job_count,
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/slurm.py", line 106, in inner_submit
    return self.submit_text("\n".join(script_lines))
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/schedulers/slurm.py", line 71, in submit_text
    jobid, _ = chcall("sbatch --parsable {}".format(filename))
  File "/u/georgwie/conda-envs/segm-3.6_spineheads/lib/python3.6/site-packages/cluster_tools/util.py", line 49, in chcall
    raise CommandError(command, code, stderr)
cluster_tools.util.CommandError: 'sbatch --parsable artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut/_temp_slurmX2u3Ik2ptVGmQDA5WRu4mzYnfKUZEYCD.sh' exited with status 1: b'sbatch: error: Batch job submission failed: Invalid job array specification\n'

The content of the mentioned bash file is:

#!/bin/sh
#SBATCH --output=artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut/slurmpy.stdout.%A_%a.log
#SBATCH --job-name "predict_step"
#SBATCH --array=0-16199
#SBATCH --mem=50G
srun /u/georgwie/conda-envs/segm-3.6_spineheads/bin/python -m cluster_tools.remote Fh3gCXzuMXE1joZxEMly0KdKBM03i97t artifacts/default/predict/mhlab_ds2_with_type_spine_head_68_boxes_005__model_10k__l4_spine_heads/cfut

Validate resource requirements

For example, slurm expects 10G while PBS demands 10GB. Right now, failing to provide this correctly will result in a silent error.

White-list keyword arguments

get_executor(environment, *args, **kwargs) fails if keyword args are passed that the executor doesn't understand. These arguments should be discarded so that the caller doesn't have to worry about passing the right arguments (it would just pass all arguments that could be relevant).

It might make sense to explicitly white-list arguments for each executor so that in the future, we can also do a translation between different executors (e.g. slurm and PBS (#14)).

This issue is blocking https://github.com/scalableminds/segmentation-tools/issues/421

unexpected end of data

I've had this exception running a long-running job on gaba.

The job seems to have completed successfully (/tmpscratch/georgwie/artifacts/default/compute_segment_types/segment_types_sorted_probabilities__45f3b33b5c/cfut/slurmpy.stdout.15280248.log). It seems like this is a problem with streaming the log. Not sure what could be the problem or if this happens often.

Traceback (most recent call last):
  File "/u/georgwie/conda-envs/segm-3.6_1/bin/vx", line 11, in <module>
    load_entry_point('voxelytics', 'console_scripts', 'vx')()
  File "/u/georgwie/voxelytics1/voxelytics/__main__.py", line 311, in main
    workflow.run(task_selection, strategy_for_task=strategy_for_task)
  File "/u/georgwie/voxelytics1/voxelytics/commons/workflow/workflow.py", line 255, in run
    execute()
  File "/u/georgwie/voxelytics1/voxelytics/commons/workflow/workflow.py", line 230, in execute
    task.execute()
  File "/u/georgwie/voxelytics1/voxelytics/connect/abstract_tasks/distributable_task.py", line 191, in execute
    executor.forward_log(fut)
  File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/site-packages/cluster_tools/schedulers/cluster_executor.py", line 353, in forward_log
    tailer.follow(2)
  File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/site-packages/cluster_tools/tailf.py", line 33, in follow
    line = file_.readline()
  File "/u/georgwie/conda-envs/segm-3.6_1/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

Return Job ID

Slurm executor's submit() should return a job ID (or maybe it can be a property of the future), so that we can print it if we want to.

add executor that uses a for-loop

It would be great to have an executor (e.g. DebugExecutor) that just wraps a normal for-loop. This should only be used for debugging purposes, but it would be very helpful in such cases. One use case is e.g. when running tests, dropping into pdb is not possible in futures, the resulting shell is in the context of the caller then.

Restart Failed Jobs

We've experienced that on some clusters, the jobs might fail in a way that simply retrying them would help. For example, some nodes might not have the necessary libraries installed and restarting would likely schedule the job on a healthy node.

There are Slurm primitives to implement this (Manuel is currently implementing that on the Matlab Pipeline).

Add synchronous Executor

We currently have a sequential executor, but it's not the same as running the code synchronously. For example, in the case of an error, the next job is executed, instead of failing immediately.

I think it would be helpful to have an executor which executes its function directly on submit() and returns a resolved future.

Pickle scalars only once

When submitting subjobs, the function and scalar arguments (now part of the function due to partial) are pickled once per subjob, even though all of those are identical. Maybe it is enough to pickle it once and have all subjobs read it

Unify array job notation

squeue and log prints use <taskid>_<subid>, but the log files are stored in slurmpy.stdout.<taskid>.<subid>.log. Would be great to unify that.

Side note: Is slurmpy and .cfut still a good naming choice?

Implement `get_executor()` function

We should implement a helper function that can be used by all projects, like get_executor(environment: str) -> Executor.

Environments should include slurm, multiprocessing and sequential. It should also accept optional keyword arguments for each executor type.

Slurm: Don't use 100% of the cluster

When we start large jobs, the cluster is completely full. It would be nice to be able to specify something like "Use 80% of the cluster" so that smaller tasks are still scheduled.

I don't understand slurm enough how we would specify this. For example, with scontrol update jobid=$JOB arraytaskthrottle=$N, you can set a limit on the number of array jobs run simultaneously.

Error Reporting: Check if job is still running

On the slurm cluster, jobs sometimes get killed by slurm, e.g. because memory requirements have been exceeded. The slurm executor does not realizes this and scripts wait indefinitely for the futures to complete.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.