Giter Site home page Giter Site logo

submitit's Introduction

CircleCI Code style: black Pypi conda-forge

Submit it!

What is submitit?

Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps submission and provide access to results, logs and more. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Submitit allows to switch seamlessly between executing on Slurm or locally.

An example is worth a thousand words: performing an addition

From inside an environment with submitit installed:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12...  your addition was computed in the cluster

The Job class also provides tools for reading the log files (job.stdout() and job.stderr()).

If what you want to run is a command, turn it into a Python function using submitit.helpers.CommandFunction, then submit it. By default stdout is silenced in CommandFunction, but it can be unsilenced with verbose=True.

Find more examples here!!!

Submitit is a Python 3.8+ toolbox for submitting jobs to Slurm. It aims at running python function from python code.

Install

Quick install, in a virtualenv/conda environment where pip is installed (check which pip):

  • stable release:
    pip install submitit
    
  • stable release using conda:
    conda install -c conda-forge submitit
    
  • main branch:
    pip install git+https://github.com/facebookincubator/submitit@main#egg=submitit
    

You can try running the MNIST example to check that everything is working as expected (requires sklearn).

Documentation

See the following pages for more detailled information:

  • Examples: for a bunch of examples dealing with errors, concurrency, multi-tasking etc...
  • Structure and main objects: to get a better understanding of how submitit works, which files are created for each job, and the main objects you will interact with.
  • Checkpointing: to understand how you can configure your job to get checkpointed when preempted and/or timed-out.
  • Tips and caveats: for a bunch of information that can be handy when working with submitit.
  • Hyperparameter search with nevergrad: basic example of nevergrad usage and how it interfaces with submitit.

Goals

The aim of this Python3 package is to be able to launch jobs on Slurm painlessly from inside Python, using the same submission and job patterns than the standard library package concurrent.futures:

Here are a few benefits of using this lightweight package:

  • submit any function, even lambda and script-defined functions.
  • raises an error with stack trace if the job failed.
  • requeue preempted jobs (Slurm only)
  • swap between submitit executor and one of concurrent.futures executors in a line, so that it is easy to run your code either on slurm, or locally with multithreading for instance.
  • checkpoints stateful callables when preempted or timed-out and requeue from current state (advanced feature).
  • easy access to task local/global rank for multi-nodes/tasks jobs.
  • same code can work for different clusters thanks to a plugin system.

Submitit is used by FAIR researchers on the FAIR cluster. The defaults are chosen to make their life easier, and might not be ideal for every cluster.

Non-goals

  • a commandline tool for running slurm jobs. Here, everything happens inside Python. To this end, you can however use Hydra's submitit plugin (version >= 1.0.0).
  • a task queue, this only implements the ability to launch tasks, but does not schedule them in any way.
  • being used in Python2! This is a Python3.8+ only package :)

Comparison with dask.distributed

dask is a nice framework for distributed computing. dask.distributed provides the same concurrent.futures executor API as submitit:

from distributed import Client
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=1, cores=2, memory="2GB")
cluster.scale(2)  # this may take a few seconds to launch
executor = Client(cluster)
executor.submit(...)

The key difference with submitit is that dask.distributed distributes the jobs to a pool of workers (see the cluster variable above) while submitit jobs are directly jobs on the cluster. In that sense submitit is a lower level interface than dask.distributed and you get more direct control over your jobs, including individual stdout and stderr, and possibly checkpointing in case of preemption and timeout. On the other hand, you should avoid submitting multiple small tasks with submitit, which would create many independent jobs and possibly overload the cluster, while you can do it without any problem through dask.distributed.

Contributors

By chronological order: Jérémy Rapin, Louis Martin, Lowik Chanussot, Lucas Hosseini, Fabio Petroni, Francisco Massa, Guillaume Wenzek, Thibaut Lavril, Vinayak Tantia, Andrea Vedaldi, Max Nickel, Quentin Duval (feel free to contribute and add your name ;) )

License

Submitit is released under the MIT License.

submitit's People

Contributors

adefossez avatar anirudh257 avatar charmoniumq avatar gwenzek avatar jjgo avatar jrapin avatar lematt1991 avatar louismartin avatar lowikc avatar lyndond avatar marcszafraniec avatar mbchang avatar mortimerp9 avatar patricklabatut avatar qasfb avatar quentinduval avatar queuecumber avatar sgbaird avatar sshkhr avatar timlacroix avatar tmct avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

submitit's Issues

Latest release is still using pickle to dump objects.

Despite 1.1.3 having been released after fc70208 was landed, it does not contain this fix.
must have been some mistake with the release.
can you release an update?

~/dev/hydra/examples/tutorials/structured_configs/4_defaults$ python my_app.py hydra/launcher=submitit_local -m
[2020-11-18 20:42:46,360][HYDRA] Submitit 'local' sweep output dir : multirun/2020-11-18/20-42-46
[2020-11-18 20:42:46,361][HYDRA]        #0 : 
Traceback (most recent call last):
  File "/home/omry/dev/hydra/hydra/_internal/utils.py", line 196, in run_and_report
    return func()
  File "/home/omry/dev/hydra/hydra/_internal/utils.py", line 353, in <lambda>
    lambda: hydra.multirun(
  File "/home/omry/dev/hydra/hydra/_internal/hydra.py", line 137, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/home/omry/dev/hydra/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/omry/dev/hydra/plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py", line 150, in launch
    return [j.results()[0] for j in jobs]
  File "/home/omry/dev/hydra/plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py", line 150, in <listcomp>
    return [j.results()[0] for j in jobs]
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/site-packages/submitit/core/core.py", line 286, in results
    outcome, result = self._get_outcome_and_result()
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/site-packages/submitit/core/core.py", line 365, in _get_outcome_and_result
    raise utils.UncompletedJobError(
submitit.core.utils.UncompletedJobError: Job 32380 (task: 0) with path /home/omry/dev/hydra/examples/tutorials/structured_configs/4_defaults/multirun/2020-11-18/20-42-46/.submitit/32380/32380_0_result.pkl
has not produced any output (state: FINISHED)
Error stream produced:
----------------------
submitit ERROR (2020-11-18 20:42:46,769) - Could not dump error:
Can't pickle <class '__main__.Config'>: attribute lookup Config on __main__ failed

because of A temporary saved file already exists.
submitit ERROR (2020-11-18 20:42:46,769) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in submitit_main
    process_job(args.folder)
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/site-packages/submitit/core/submission.py", line 58, in process_job
    raise error
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/site-packages/submitit/core/submission.py", line 49, in process_job
    utils.pickle_dump(("success", result), tmppath)
  File "/home/omry/miniconda3/envs/hydra38/lib/python3.8/site-packages/submitit/core/utils.py", line 278, in pickle_dump
    pickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
_pickle.PicklingError: Can't pickle <class '__main__.Config'>: attribute lookup Config on __main__ failed

~/dev/hydra/examples/tutorials/structured_configs/4_defaults$

After pip installing from github:

~/dev/hydra/examples/tutorials/structured_configs/4_defaults$ python my_app.py hydra/launcher=submitit_local -m
[2020-11-18 20:43:35,456][HYDRA] Submitit 'local' sweep output dir : multirun/2020-11-18/20-43-35
[2020-11-18 20:43:35,457][HYDRA]        #0 : 
~/dev/hydra/examples/tutorials/structured_configs/4_defaults$

Installation error

pip install submitit

Collecting submitit
  Downloading https://files.pythonhosted.org/packages/3d/7a/af94dc7bb279f84346419b1b0a036479878fe97b0e2913369f18d4bf5880/submitit-1.1.1.tar.gz (66kB)
    100% |################################| 71kB 205kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-kktwvkje/submitit/setup.py", line 30, in <module>
        long_description=Path("README.md").read_text(),
      File "/usr/lib/python3.6/pathlib.py", line 1197, in read_text
        return f.read()
      File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5686: ordinal not in range(128)
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-kktwvkje/submitit/

Fixed it by changing this line in setup.py
long_description=Path("README.md").read_text(),
to
long_description=Path("README.md").read_text(encoding='utf-8'),

Cloudpick Cannot Pickle Weakref Objects with SQLAlchemy

Hi guys

I having an issue when I'm trying to use the SQLAlchemy ORM, I'm seeing if I can put together a minimal example that fails, but I'm not sure exactly what submitit is trying to pickle that is causing the problem. The relevant part of thee traceback is:

...
File "/private/home/mehrlich/submitit/submitit/core/core.py", line 606, in submit
    return self._internal_process_submissions([ds])[0]
  File "/private/home/mehrlich/submitit/submitit/slurm/slurm.py", line 314, in _internal_process_submissions
    return super()._internal_process_submissions(delayed_submissions)
  File "/private/home/mehrlich/submitit/submitit/core/core.py", line 726, in _internal_process_submissions
    delayed.dump(pickle_path)
  File "/private/home/mehrlich/submitit/submitit/core/utils.py", line 131, in dump
    cloudpickle_dump(self, filepath)
  File "/private/home/mehrlich/submitit/submitit/core/utils.py", line 283, in cloudpickle_dump
    cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
...
TypeError: can't pickle weakref objects

and I've narrowed it down the the following class:

class Result(Base):
    __tablename__ = "results"

    model = Column(String, primary_key=True)
    dataset = Column(String, primary_key=True)
    quality = Column(Integer, primary_key=True)
    mitigation = Column(String, primary_key=True)
    metric_key = Column(String)
    metric_value = Column(Float)

    def __repr__(self):
        return f"<Result(model_name='{self.model}', dataset='{self.dataset}', quality='{self.quality}', mitigation='{self.mitigation}', metric_key='{self.metric_key}', metric_value='{self.metric_value}')>"

which is defined at module scope. I can get the job to run under three conditions:

  1. Removing this class definition
  2. Removing the inheritance from Base (which comes from SQLAlchemy)
  3. Removing any references in the function that I'm submitting to the Result class

I also tried moving the class definition from module scope to a function that returns the class which I run from the submitted function and it still fails with the same error.

Do you guys have any idea what might be happening?

Slurm job name with non alphanumeric symbols sanitized to underscores

Submitit is not able to start a job with a name containing non alphanumeric symbols e.g project-subproject:test1. If user pass such name it is changed to project_subproject_test1.

It is possible to configure a slurm cluster to force an user to add non alphanumeric symbols in a job name. Big clusters require such features for better allocation of resources to different projects.

Reproduction

Create python program to run jobs with colon and run it:

user@cluster-head-node: ./colon_job_executor.py --job_name project-subproject:test1

Expected result
The job project-subproject:test1 is executed at cluster.

Actual results:

The executed job name is project_subproject_test1

2021-10-21 04:26:02,340 INFO The job project-subproject:test1 executed
sbatch: error: Please set a name for this job, formatted like this:
sbatch: error:  project-<subproject>:*
sbatch: error: Batch job submission failed: Access/permission denied
subprocess.CalledProcessError: Command '['sbatch', './file_job.sh']' returned non-zero exit status 1.

Root cause

The _make_sbatch_string function in slurm.py applies utils.sanitize to job_name without only_alphanum argument set to false.

https://github.com/facebookincubator/submitit/blob/main/submitit/slurm/slurm.py#L463

    parameters["signal"] = f"USR1@{signal_delay_s}"
    if job_name:
        parameters["job_name"] = utils.sanitize(job_name) #Line 463 - sanitize replaces "-" and ":" with "_"
    if comment:
        parameters["comment"] = utils.sanitize(comment, only_alphanum=False)
    if num_gpus is not None:
        warnings.warn(
            '"num_gpus" is deprecated, please use "gpus_per_node" instead (overwritting with num_gpus)'
        )
        parameters["gpus_per_node"] = parameters.pop("num_gpus", 0)

Solution

The solution can be changing line 463:

    if job_name:
        parameters["job_name"] = utils.sanitize(job_name, only_alphanum=False) 
    if comment:

Submitit 1.4 fails due to invalid generic resource specification

When I do "pip install submitit" it installs version 1.4 (which doesn't show up in the github release list?). When I run a simple script that works on version 1.2, I get:

Traceback (most recent call last):                                                                                                                                   [0/1833]
  File "submit.py", line 10, in <module>
    job = executor.submit(function)
  File "/private/home/spowers/.conda/envs/venv_conda_init_debug/lib/python3.8/site-packages/submitit/core/core.py", line 663, in submit
    job = self._internal_process_submissions([ds])[0]
  File "/private/home/spowers/.conda/envs/venv_conda_init_debug/lib/python3.8/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/private/home/spowers/.conda/envs/venv_conda_init_debug/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 313, in _internal_process_submissions
    return super()._internal_process_submissions(delayed_submissions)
  File "/private/home/spowers/.conda/envs/venv_conda_init_debug/lib/python3.8/site-packages/submitit/core/core.py", line 822, in _internal_process_submissions
    job = self._submit_command(self._submitit_command_str)
  File "/private/home/spowers/.conda/envs/venv_conda_init_debug/lib/python3.8/site-packages/submitit/core/core.py", line 863, in _submit_command
    output = utils.CommandFunction(command_list, verbose=False)()  # explicit errors
  File "/private/home/spowers/.conda/envs/venv_conda_init_debug/lib/python3.8/site-packages/submitit/core/utils.py", line 350, in __call__
    raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

If helpful I can post the generated shell script. Here's my executor definition:

executor = submitit.AutoExecutor(folder="tmp/log_test")
executor.update_parameters(timeout_min=1440, slurm_partition="learnlab,learnfair", slurm_gpus_per_task=1, slurm_cpus_per_task=80, slurm_mem_gb=80)
    

Is there a known backwards compatibility issue?

fields() from DataClasses not working properly

Hey I just wanted to ask about strange error I have.
fileds() from dataclasses not working properly if dataclass is declared outside running function.

Minimalistic example:

DataClass in function

from dataclasses import dataclass, fields

import submitit


def run():
    @dataclass
    class Test:
        test: str = "test"
        u: int = 1

    t = Test()
    print(f"Test: {fields(t)}")
    return 1


executor = submitit.AutoExecutor(folder="log_test")
executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(run)
print(job.result())
print(job.stdout())

Print:

Test: (Field(name='test',type=<class 'str'>,default='test',default_factory=<dataclasses._MISSING_TYPE object at 0x7f5bceda6cd0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD), Field(name='u',type=<class 'int'>,default=1,default_factory=<dataclasses._MISSING_TYPE object at 0x7f5bceda6cd0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD))

DataClass outside function

from dataclasses import dataclass, fields

import submitit

@dataclass
class Test:
    test: str = "test"
    u: int = 1

def run():
    t = Test()
    print(f"Test: {fields(t)}")
    return 1


executor = submitit.AutoExecutor(folder="log_test")
executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(run)
print(job.result())
print(job.stdout())

Print:

Test: ()

The exact same thing happens on slurm cluster with python 3.7.4 and on my local computer without slurm with python 3.9.
Is this submitit or threading issue?

Job arrays with more than max # jobs

Curious if submitit has any facilities for handling a list of jobs with more than the specified max # jobs? (e.g. 500). For example, automatically splitting these into separate job arrays.

For example:

import submitit
from numpy.random import randint

a = randint(100, size=600)
b = randint(100, size=600)

executor = submitit.AutoExecutor(folder=log_folder)
executor.update_parameters(slurm_array_parallelism=2)
jobs = executor.map_array(add, a, b) 

Set additional slurm parameters

Hello,

I would like to know if it's possible to set additional slurm parameters (and how to set them), because I couldn't find this information in the documentation.

For example, I have a few arguments that I usually set using srun, such as --account=myaccount --hint=nomultithread --distribution=block:block --exclusive, but I have no idea how to set them in submitit.

Thank you in advance for your answer!

bash script without --wckey

my slurm server doesn't recognize --wckey, is there a way to create bash file without wckey argument?
i tried mentioning wckey=None, in executor.update_parameters(), but it didn't work.

thanks!

Temporary saved file already exists

Hi,

Thank you for this amazing tool! I just started using it recently. I'm encountering some weird error and I was hoping you could help me fix it. Here is the error log:

submitit WARNING (2021-03-28 01:13:17,420) - Caught signal 15 on learnfair0463: this job is preempted.
slurmstepd: error: *** STEP 38544509.0 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
slurmstepd: error: *** JOB 38544509 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
submitit WARNING (2021-03-28 01:13:17,482) - Bypassing signal 18
submitit WARNING (2021-03-28 01:13:17,483) - Caught signal 15 on learnfair0463: this job is preempted.
38544484_16: Job is pending execution
submitit ERROR (2021-03-28 01:13:17,535) - Could not dump error:
Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.

because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:13:17,535) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 55, in process_job
    utils.cloudpickle_dump(("success", result), tmppath)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 238, in cloudpickle_dump
    cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/job_environment.py", line 209, in checkpoint_and_try_requeue
    self.env._requeue(countdown)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 193, in _requeue
    subprocess.check_call(["scontrol", "requeue", jid])
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
/bin/bash: /public/apps/anaconda3/2020.11/lib/libtinfo.so.6: no version information available (required by /bin/bash)
submitit ERROR (2021-03-28 01:35:36,155) - Could not dump error:
A temporary saved file already exists.

because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:35:36,156) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
    with utils.temporary_save_path(paths.result_pickle) as tmppath:  # save somewhere else, and move
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 171, in temporary_save_path
    assert not tmppath.exists(), "A temporary saved file already exists."
AssertionError: A temporary saved file already exists.
srun: error: learnfair0292: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=38544509.1

My analysis of the error is as follows. The temporary save file error is thrown in process_job here. One possible reason why this could happen is if the tmppath was created previously in the try block, but there was a failure before the context ended.

This could happen either in the utils.cloudpickle_dump() call or in logger.info(). However, I can see a temporary save path 38544484_16_0_result.pkl.save_tmp that contains the following information ('success', None). So is the error with logger? Or am I completely off here?

I'm running a job array with 1024 jobs and 128 slurm_array_parallelism. The code run by the jobs actually completed and the results were saved. So I don't think this is an error in the python function I ran.

Import error

This bug is baffling me, I'm sure this is user error because I normally have no issues with your code. I have not figured out what is different than my other submissions, but maybe you've seen this before?

.../python3.8/site-packages/submitit/core/_submit.py", line 7, in <module>
    from .submission import submitit_main
ImportError: attempted relative import with no known parent package

Documentation on distributed checkpointing

I'm in working with a distributed training setup and need to make it work with cluster preemption. I've read the Checkpointing docs but those do not address what happens when there are multiple tasks.

I'd be happy to write a couple of examples of checkpointing multi-task jobs for the docs. However, I'm not fully sure how the internals work.

My current understanding (based on trial and error) is:

  • Only the task with global_rank 0 gets the checkpointing signal (i.e. the call to .checkpoint) and has the opportunity to return a DelayedSubmission object.
  • Since other tasks do not get the checkpoint call they have to regularly checkpoint in order to resume in the case of a requeue. This is needed when the different replicas are allowed to diverge over time or have sparse synchronization.

Combining results without leaving a shell running

Any suggestions for combining results from jobs (i.e. job.result()) without leaving a shell running the whole time? Just loop through the folders, load each pickle, and append the output? Or is there a better way, e.g. using job dependencies?

For example, a typical way to grab all the results (while leaving a shell running) would be:

import submitit

a = [1, 2, 3, 4]
b = [10, 20, 30, 40]

executor = submitit.AutoExecutor(folder=log_folder)
executor.update_parameters(slurm_array_parallelism=2)
jobs = executor.map_array(add, a, b) # should I save jobs as a .pkl? Or is there a better way to reload these?

# what if the shell stopped here? What's a simple way to perform the next few lines (or equivalent)?
out = []
for job in jobs:
    out.append(job.result())

print(out)
# [11, 22, 33, 44]

Command for after srun executes

I utilize the setup parameter now to include running commands in the sbatch script before running srun, but is it possible to add commands to be executed after srun?

Computational task with 120k iterations, ~20 CPU days total: checkpointing and general suggestions

Rather than the actual work I'm doing, here is a MWE using a simple "**2" function: i.e. take the square of a number for 120k numbers.

# modules and functions
import submitit
import pickle
from numpy.random import randint

def square(alist):
    """Return the square of a number for a list of numbers (alist)"""
    out = []
    for a in alist:
        out+= a**2
    return out

def chunks(lst, n):
    """Return successive n-sized chunks from lst."""
    out = []
    for i in range(0, len(lst), n):
        out.append(lst[i:i + n])
    return out

# data
a = randint(100, size=120000)

# submitit setup
log_folder = "log_test/%j"
walltime = 5
chunksize = 300
pars = chunks(a, chunksize) # split into chunks
partition = 'mypartition'
account = 'myaccount'

## execution
executor = submitit.AutoExecutor(folder=log_folder)
executor.update_parameters(timeout_min=walltime, slurm_partition=partition, slurm_additional_parameters={'account': account})
jobs = executor.map_array(square, pars)  # sbatch array

## concatenation
njobs = len(jobs)
output = []
for i in range(njobs):
    output.append(jobs[i].result())

# save output
with open('output.pkl', 'wb') as f:
    pickle.dump(output, f)

If every job runs successfully to competition (no timeout, preemption, etc.), then I should get output.pkl with all 120k results. However, if any one of the jobs fails (due to timeout, etc.) then it will throw an error at output.append(jobs[i].result()), which relates to #1627 and #1625. In the real task, some jobs might finish in 40 minutes while others will take over 2 hrs, despite having the same chunk size of 300. In my specific case, the time will scale linearly with the total number of atoms across all crystal structures in the job. For example, the chunk size is 300 (i.e. 300 crystal structures), and collectively these crystal structures have e.g. 3671 atoms. This will take about twice as long as a chunk that only has ~1500 atoms in total. I've thought of a few options:

  1. set an overkill walltime
  2. identify failed jobs using slurm commands and rerun these with higher walltimes after all jobs have finished, then concatenate in a separate python instance (see #1625)
  3. output a file for each iteration (i.e. 120k files) and identify the files that don't exist yet and rerun these

While one of these should work, I'd like to learn/try something a bit more robust/efficient for the long term. I've been looking through the checkpointing documentation and noticed Issue #9 seems relevant. Any guidance on implementing this or some alternative?

Getting a recursion error with pickling the function

Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 639, in reducer_override
    if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint(obj):  # noqa  # pragma: no branch
RecursionError: maximum recursion depth exceeded in comparison

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "submit_epdo.py", line 79, in <module>
    jobs = executor.map_array(density_hkl, pars)  # sbatch array
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/submitit/core/core.py", line 630, in map_array
    return self._internal_process_submissions(submissions)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/submitit/auto/auto.py", line 204, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 322, in _internal_process_submissions
    d.dump(pickle_path)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/submitit/core/utils.py", line 134, in dump
    cloudpickle_dump(self, filepath)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/submitit/core/utils.py", line 238, in cloudpickle_dump
    cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 55, in dump
    CloudPickler(
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/epdo/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 570, in dump
    raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.

The function uses a with statement, the Wolfram Client Library for Python, and a custom package I wrote in Mathematica which I'm using a function from. I imagine pickling the function is what's causing the recursion depth.. If I submit it as a command, does it still pickle the function?

1.3.1 version installation fails from PyPi due to requirements.

Hello!

Installation of the newest version (1.3.1) fails with error

FileNotFoundError: [Errno 2] No such file or directory: 'requirements/main.txt'

I install with command

pip install submitit --upgrade

Although, this works perfectly:

pip install submitit==1.3.0 --upgrade

It could be replication of this issue.

Using submitit without slurm

I'd like to use submitit on a cluster that doesn't use slurm. That is, I'd like to use submitit's functionality for pickling the python environment and submitting a function to run on another machine, but specify the nodes manually rather than using the slurm scheduler. A few questions:

  1. Is this currently supported? (I believe not)
  2. Is this a reasonable thing to do? Or are there better alternatives to submitit for this use case?
  3. Is the right way to do this to write an SshExecutor to replace the SlurmExecutor? If so, are there any pitfalls I should be concerned about if I try to implement this?

Thanks!

Access some information about the job when reloading it

Hi!

Would it be possible to have access to some information about a job when reloading a Job with its job_id?

My use case is the following: I launched a lot of jobs, and I want to plot some metrics I logged. Most of the time, I only care about the jobs I just launched, or the jobs I launched the day before. Therefore, I would need to filter my jobs according to their launching time. If I'm correct, this is not currently possible.

Other information might be interesting, for instance knowing whether a job has been preempted, since this is a common bug source.

I tag @jrapin here because I talked with him about this feature.

Strange bug upon preemption

Hi, since the cluster update I am encoutering the following error upon preemption:
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)

Full log:

submitit WARNING (2021-03-15 17:50:23,701) - Bypassing signal 15
WARNING:submitit:Bypassing signal 15
slurmstepd: error: *** JOB 37361369 ON learnfair5146 CANCELLED AT 2021-03-15T18:00:46 DUE TO PREEMPTION ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
submitit WARNING (2021-03-15 18:00:46,902) - Bypassing signal 15
WARNING:submitit:Bypassing signal 15
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
submitit ERROR (2021-03-15 18:45:40,210) - Submitted job triggered an exception
ERROR:submitit:Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/site-packages/submitit/core/submission.py", line 53, in process_job
    result = delayed.result()
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/site-packages/submitit/core/utils.py", line 126, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "run_pretrained.py", line 64, in __call__
    classification.main(self.args)
  File "/checkpoint/sdascoli/cnn2transformer/1615824683/main.py", line 199, in main
    utils.init_distributed_mode(args)
  File "/checkpoint/sdascoli/cnn2transformer/1615824683/utils.py", line 276, in init_distributed_mode
    world_size=args.world_size, rank=args.rank)
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/private/home/sdascoli/.conda/envs/bert/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 215, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
srun: error: learnfair5094: tasks 0-1,3-7: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=37361369.1
slurmstepd: error: *** STEP 37361369.1 ON learnfair5094 CANCELLED AT 2021-03-15T18:45:41 ***
srun: error: learnfair5094: task 2: Exited with exit code 1

Any ideas where this can be coming from ?
Thanks :)

setting MASTER_ADDR and MASTER_PORT for distributed jobs

Normally I automatically set these values in the bash script, is there a way to add additional commands? For example

export MASTER_ADDR=$(hostname -s) 
export MASTER_PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()') 

Add qos parameter to Slurm Executor

Slurm implements a Quality of Service (QOS), see https://slurm.schedmd.com/qos.html
that is used in our cluster to prioritize jobs and groups of users. Some partitions also require specific QOS.

It would be great if submitit could include the QOS option in the generated sbatch script.

I assume this would only require modifying the _make_sbatch_string function in slurm.py

qos: tp.Optional[str] = None,

and the documentation.

Support for OpenPBS/PBS Pro Scheduler

It would be highly useful to add support for the OpenPBS/PBS Pro Scheduler, as far as I understand SLURM is very similar?

From a very brief look at the codebase it already looks like all the SLURM related code is split out into its own module. I would offer to try contribute to this feature depending on available time and with appropriate direction, although I am not intimately familiar with either schedulers.

Downstream this would also lead to support for the hydra submitit plugin.

NotImplementedError when getting a job result for version 1.1.4

Hi,

I have been using submitit a lot recently to submit my jobs on a SLURM-based HPC. It has been very good to me, allowing me to get rid of an extra API writing, so thanks for this project.

When I installed version 1.1.4, I got an error whose stacktrace is as follows:

Traceback (most recent call last):
  File "three_d.py", line 19, in <module>
    train_eval_grid(
  File "/gpfsdswork/projects/rech/hih/uap69lx/submission-scripts/jean_zay/submitit/general_submissions.py", line 63, in train_eval_grid
    run_ids = [job.result() for job in jobs]
  File "/gpfsdswork/projects/rech/hih/uap69lx/submission-scripts/jean_zay/submitit/general_submissions.py", line 63, in <listcomp>
    run_ids = [job.result() for job in jobs]
  File "/linkhome/rech/gencea10/uap69lx/.conda/envs/dis-mri-recon/lib/python3.8/site-packages/submitit/core/core.py", line 266, in result
    r = self.results()
  File "/linkhome/rech/gencea10/uap69lx/.conda/envs/dis-mri-recon/lib/python3.8/site-packages/submitit/core/core.py", line 284, in results
    self.wait()
  File "/linkhome/rech/gencea10/uap69lx/.conda/envs/dis-mri-recon/lib/python3.8/site-packages/submitit/core/core.py", line 387, in wait
    while not self.done():
  File "/linkhome/rech/gencea10/uap69lx/.conda/envs/dis-mri-recon/lib/python3.8/site-packages/submitit/core/core.py", line 424, in done
    if self.watcher.is_done(self.job_id, mode="force" if force_check else "standard"):
  File "/linkhome/rech/gencea10/uap69lx/.conda/envs/dis-mri-recon/lib/python3.8/site-packages/submitit/core/core.py", line 100, in is_done
    state = self.get_state(job_id, mode=mode)
  File "/linkhome/rech/gencea10/uap69lx/.conda/envs/dis-mri-recon/lib/python3.8/site-packages/submitit/core/core.py", line 53, in get_state
    raise NotImplementedError
NotImplementedError

I didn't get the chance to work on a minimal reproducible example yet, but I can tell you that this happens in 1.1.4 and doesn't happen in 1.1.3.
I can point you in the meantime to the failing line in my public code: here.

Include script directory to the search path?

Feedback received:

It would be great if submitit also includes the script directory into its search path.
So the script could be running like non-submitit cases.

submitit is expected to work is if it were run locally, so adding the script path can definitely make sense if it has no border effect (anything to fear?)

specify memory in MB

Recently, I wanted to submit a job that requires 2.5 GB of memory. However, the memory requirement is input through the mem_gb parameter, which accepts an integer only. Slurm can handle memory requirements in multiple units, as per sbatch docs.

--mem=<size[units]>
Specify the real memory required per node. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T].

The solution might be to add exclusive mem_kb, mem_mb, and mem_tb kwargs in submitit/slurm/slurm.py. in addition to mem_gb, or allow setting the memory as a string, e.g. mem='2500MB'.

Thanks!

How to comment a slurm variable?

Hi All,

I observe the following error which is due to added: "#SBATCH --gpus-per-node=4" line in the generated slurm script.

Error : submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

Can developers/users of submitit guide me on where to comment/delete the above line in the slum script before it is submitted by the sbatch command?

Thanks,
Amit Ruhela

Multi node jobs run till timeout even when done

Hi,

Thanks for the submitit utility, very useful for slurm.

When I use the utility for running multi-node training on slurm (say with a 6hr timeout window -- max my server allows), I see if the job is small and is done within 6 hrs, I can still see it running on slurm till it times out (using squeue). Guessing it has to do with the way submitit winds down the processes. (The job shows complete on the master processes' log including the submitit INFO message 'Job completed successfully')

mem-per-gpu

I want to use --mem-per-gpu which I can set using the addtional-parameters, but since it is incompatible with the --mem flag, it causes sbatch to crash since --mem is set automatically. Can --mem be made optional?

[Bug] Submitit crashes when executor folder contains `\=0` or `"=0"`

Submitit fails then the log directory (folder passed to executor) are of particular form. Examples of logdir that cause this error, but are valid directory names:

  • logdir = 'submitit_outputs/a"=0"/%j'
  • logdir = 'submitit_outputs/a\=0/%j'
  • logdir = 'submitit_outputs/a\=/%j'

submitit_test.py

import submitit

def add(a, b):
    return a + b

logdir = 'submitit_outputs/a"=0"/%j'
executor = submitit.AutoExecutor(folder=logdir)
executor.update_parameters(timeout_min=4, slurm_partition="priority", slurm_comment='iccv')

job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(logdir, job.job_id)  # ID of your job

output = job.result()  # waits for the submitted function to complete and returns its output
# if ever the job failed, job.result() will raise an error with the corresponding trace
assert output == 12  # 5 + 7 = 12...  your addition was computed in the cluster

Output

$ python submitit_test.py 
submitit_outputs/a"=0"/%j 35101624
Traceback (most recent call last):
  File "submitit_test.py", line 13, in <module>
    output = job.result()  # waits for the submitted function to complete and returns its output
  File "/private/home/shubhamgoel/.conda/envs/py3d-implicit/lib/python3.8/site-packages/submitit/core/core.py", line 266, in result
    r = self.results()
  File "/private/home/shubhamgoel/.conda/envs/py3d-implicit/lib/python3.8/site-packages/submitit/core/core.py", line 289, in results
    outcome, result = self._get_outcome_and_result()
  File "/private/home/shubhamgoel/.conda/envs/py3d-implicit/lib/python3.8/site-packages/submitit/core/core.py", line 368, in _get_outcome_and_result
    raise utils.UncompletedJobError(
submitit.core.utils.UncompletedJobError: Job 35101624 (task: 0) with path /private/home/shubhamgoel/code/3Dify/hydra_test/submitit_outputs/a"=0"/35101624/35101624_0_result.pkl
has not produced any output (state: FAILED)
Error stream produced:
----------------------
None

Environment
submitit 1.1.5
Python 3.8.5
Ubuntu 20.04.1 LTS

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Hi All,
I am trying to run DINO on multiple nodes with submitit. We have a slurm server and I am able to train DINO on the slurm server using a single node (8gpus) [WITHOUT USING submitit] but when I try to run with multiple nodes, I am getting the below error:

submitit ERROR (2021-07-30 01:10:30,581) - Submitted job triggered an exception
Traceback (most recent call last):
File "/home/user/skanaconda3/envs/url/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/skanaconda3/envs/url/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in
submitit_main()
File "/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
process_job(args.folder)
File "/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
raise error
File "/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
result = delayed.result()
File "/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/utils.py", line 128, in result
self._result = self.function(*self.args, **self.kwargs)
File "run_with_submitit.py", line 67, in call
main_dino_initialize_all.train_dino(self.args)
File "/home/user/code/dino/main_dino_initialize_all.py", line 143, in train_dino
utils.init_distributed_mode(args)
File "/home/user/code/dino/utils.py", line 468, in init_distributed_mode
dist.init_process_group(
File "/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 439, in init_process_group
_default_pg = _new_process_group_helper(
File "/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 528, in _new_process_group_helper
pg = ProcessGroupNCCL(
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

From logs, I see that the job initially gets assigned to two nodes [with 8 gpus in each node] and then stops with the above error.
Thanks in advance!

RuntimeError: No shared folder available

Traceback (most recent call last):
File "run_with_submitit.py", line 130, in
main()
File "run_with_submitit.py", line 89, in main
args.job_dir = get_shared_folder() / "%j"
File "run_with_submitit.py", line 40, in get_shared_folder
raise RuntimeError("No shared folder available")
RuntimeError: No shared folder available

Error whilst

Hi,

I was trying to run the example(add(a,b)) provided https://github.com/facebookincubator/submitit, my hpc cluster is throwing following error (script1)
submitit.core.utils.FailedJobError: sbatch: error: Job rejected: Please do not specify cores/CPUs/tasks for GPU jobs.

So when I unset ntasks_per_node variable, my hpc cluster is throwing the following error (script 2)
IndexError: tuple index out of range

Can you please advice to resolve this
script1.txt
script2.txt

TypeError: an integer is required (got type bytes)

Since upgrading to python 3.8 I can't access my old jobs' submission pickle (error below).

The problem might be related to this issue or this one but I have no clue what it means.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-0248393e65dd> in <module>
     30         if state != "COMPLETED":
     31             continue
---> 32         row = job.submission().kwargs
     33         row["scores"] = job.result()
     34         row["exp_name"] = exp_name

~/dev/ext/submitit/submitit/core/core.py in submission(self)
    206             self.paths.submitted_pickle.exists()
    207         ), f"Cannot find job submission pickle: {self.paths.submitted_pickle}"
--> 208         return utils.DelayedSubmission.load(self.paths.submitted_pickle)
    209 
    210     def cancel_at_deletion(self, value: bool = True) -> "Job[R]":

~/dev/ext/submitit/submitit/core/utils.py in load(cls, filepath)
    133     @classmethod
    134     def load(cls: Type["DelayedSubmission"], filepath: Union[str, Path]) -> "DelayedSubmission":
--> 135         obj = pickle_load(filepath)
    136         # following assertion is relaxed compared to isinstance, to allow flexibility
    137         # (Eg: copying this class in a project to be able to have checkpointable jobs without adding submitit as dependency)

~/dev/ext/submitit/submitit/core/utils.py in pickle_load(filename)
    271     # this is used by cloudpickle as well
    272     with open(filename, "rb") as ifile:
--> 273         return pickle.load(ifile)
    274 
    275 

TypeError: an integer is required (got type bytes)

Repro:
Start a job with python 3.7 and then try to access it in python 3.8.
In python 3.7

Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import submitit
>>>
>>> def add(a, b):
...     return a + b
...
>>> executor = submitit.AutoExecutor(folder="log_test")
>>> executor.update_parameters(timeout_min=1, slurm_partition="dev")
>>> job = executor.submit(add, 5, 7)
>>> print(job.job_id)
33389760
>>> job.submission()
<submitit.core.utils.DelayedSubmission object at 0x7f42f5952bd0>

In python 3.8

Python 3.8.5 | packaged by conda-forge | (default, Jul 24 2020, 01:25:15)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import submitit
>>> job = submitit.SlurmJob("log_test", "33389760")
>>> job.submission()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
    return utils.DelayedSubmission.load(self.paths.submitted_pickle)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
    obj = pickle_load(filepath)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
    return pickle.load(ifile)
TypeError: an integer is required (got type bytes)
>>> job.submission()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
    return utils.DelayedSubmission.load(self.paths.submitted_pickle)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
    obj = pickle_load(filepath)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
    return pickle.load(ifile)
TypeError: an integer is required (got type bytes)

Temporary pickle path is not always removed

pickle_path = utils.JobPaths.get_first_id_independent_folder(self.folder) / f"{tmp_uuid}.pkl"

I am having problems with the temporary pickle path created at job submission. When looking at my submitit directory, there are tons of temporary pickles leftover (probably thousands). I was looking at this because submitting jobs is very on my system (submitting 45 jobs takes 50 seconds and profiling tells me that the culprits are io.open, posix.mkdir and posix.rename).

Two solutions I see to solve the leftover pickle paths:

  • One of my guess is that they are not removed when an error happens during submission, so a simple method would be to add a try-except and remove them on error.
  • A cleaner solution would be to create a temp file in the temp directory of the system tempfile.mkstemp instead of in the submitit dir. This way the system manages clearing the temporary storage. A small drawback is that if the temporary storage is not on the same device as the submitit dir, we will have an additional cross-device copy operation happening.

SLURM sweep with hydra + submitit

I try to sweep a set of hyperparams using the slurm submitit hydra plugin.

I run:

python run.py --multirun --config-name atari-slurm seed=1,2,3,4,5

And my config file looks something like this:

defaults:
  - _self_
  - task@_global_: atari/breakout
  - override hydra/launcher: submitit_slurm

seed: 1

hydra:
  run:
    dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M%S}_${hydra.job.override_dirname}
  sweep:
    dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
    subdir: ${hydra.job.num}
  launcher:
    timeout_min: 4300
    cpus_per_task: 4
    gpus_per_node: 4
    tasks_per_node: 4
    mem_gb: 160
    nodes: 2
    partition: gpu
#    gres: gpu:4
    cpus_per_gpu: 16
    gpus_per_task: 1
    constraint: K80
#    mem_per_gpu: null
#    mem_per_cpu: null
    submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm

However, I get this error:

raise InterpolationKeyError(f"Interpolation key '{inter_key}' not found")
omegaconf.errors.InterpolationKeyError: Interpolation key 'agent_cfg.experiment' not found

*I'm running this on a slurm cluster where we ordinarily use sbatch to submit jobs such as a multi-gpu job like the one defined in the config above.

short jobs timeout immediately

There are some magic numbers in the process of determining whether a job timed out:

def has_timed_out(self) -> bool:
# SignalHandler is created by submitit as soon as the process start,
# so _start_time is an accurate measure of the global runtime of the job.
walltime = time.time() - self._start_time
max_walltime = self._delayed._timeout_min * 60
guaranteed_walltime = min(max_walltime * 0.8, max_walltime - 10 * 60)
timed_out = walltime >= guaranteed_walltime

This seems to cause short jobs to immediately timeout:

[2021-10-21 11:50:15,215][submitit][INFO] - Job has timed out. Ran 0 minutes out of requested 2 minutes.
[2021-10-21 11:50:15,216][submitit][WARNING] - Caught signal SIGUSR1 on rmc-gpu05: this job is timed-out.
[2021-10-21 11:50:15,216][submitit][INFO] - Calling checkpoint method.
[2021-10-21 11:50:15,256][submitit][INFO] - Job not requeued because: timed-out too many times.

ValueError: LocalExecutor can use only one node. Use nodes=1

Traceback (most recent call last):
File "run_with_submitit.py", line 131, in
main()
File "run_with_submitit.py", line 116, in main
**kwargs
File "/opt/tiger/conda/lib/python3.7/site-packages/submitit/core/core.py", line 638, in update_parameters
self._internal_update_parameters(**kwargs)
File "/opt/tiger/conda/lib/python3.7/site-packages/submitit/auto/auto.py", line 197, in _internal_update_parameters
self._executor._internal_update_parameters(**parameters)
File "/opt/tiger/conda/lib/python3.7/site-packages/submitit/local/local.py", line 158, in _internal_update_parameters
raise ValueError("LocalExecutor can use only one node. Use nodes=1")
ValueError: LocalExecutor can use only one node. Use nodes=1

Specify partition & hardware

Hi, thank you for this library.

I skimmed through the docs, but I donot see any option for specifying a partition.

I see that hydra's command line plugin for Submitit supports this, so I'm probably missing something here.

Is there detailed documentation for submitit?

[BUG] `Scontrol` Error when checkpointing / preemption on slurm

Hi,

For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint:
FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

Specifically, I can reproduce this error by running docs/mnist.py, I ran the following three version of the mnist example to understand the issue:

  • Running docs/mnist.py on slurm as is, I get the previous error. Full logs: stderr , stdout
  • If I ssh into some slurm node that I get allocated to and run docs/mnist.py on the local executer (cluster="local") everything works as it should: so submitit + checkpointing works fine.
  • Running docs/mnist.py but without preemption ( removing timeout_min and job._interrupt()) everything works fine: so slurm + submitit work fine.

Also scontrol seems to work fine on my login node, so I don't understand why the check_call(["scontrol", "requeue", jid]) does not work. That being said, Scontrol does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid]) is called from where I call submitit and thus not having scontrol on the allocated nodes shouldn't be an issue, am I correct?

Thank you !

Context manager and log folder naming

Using the batch context manager, I'm having trouble naming output folders according to the arguments of my for loops. Is there a way to do the following:

with executor.batch()
     for a in [1, 2]:
        for b in [3, 4, 5]:
            job = executor.submit(func, a, b)

where the output folders are formatted as: f"/path/to/outputs/%j/a{a}-b{b}/" instead of .../<array job id>_<array task id>?

I'm also open to any other suggested solution to this as well (e.g. submitting job array without using the context manager).

Logs of slurm job failure in submitit logs

I am facing an issue with jobs failing without a clear reason in the error log. I am using a combination of hydra and submit it and I am noticing some jobs end prematurely with no errors reported in either <slurm_id>_log.err or <slurm_id>_log.txt. The only indication that the job ended is the message like submitit INFO (2021-07-29 00:56:05,771) - Job completed successfully.

Are there ways the jobs can be killed that aren't reported by slurm/submitit? or are there other logs that one can inspect or options to provide more context as to why the job failed? Also, my apologies if this is not a submitit issue.

Thank you!

[Custom python environment]

Hi all !

In the context of cluster computing, it is sometime necessary to have the jobs running with a local python environment and not the one from the central one (where the job has been launched).

Currently, the python executable path is automatically extracted with shlex.quote(sys.executable) in slurm.py before being used to generate the scratch file in _make_sbatch_string.

While this is a great default behaviour, we would argue with @MJHutchinson to be able to specify a custom python executable path to override the default one, as implemented in this commit. The key line is:

 f"srun --output {stdout} --error {stderr} --unbuffered {executable} {command}",

Then, it combination with hydra, it is sufficient to add executable:/data/localhost/not-backed-up/${env:USER}/utils/venv_projec_name/bin/python in the slurm config.

If that sounds like a good solution to you all, I'll push a PR, otherwise I'm happy to discuss the issue.

Additionally, it seems that it is now possible to execute commands before running srun cf hydra_submitit_launcher. Would be great to also have a teardown argument to execute commands after running srun.

Best,
Emile

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.