malramsay64 / experi Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 307 KB

An interface for managing computational experiments with many independent variables.

License: MIT License

Python 100.00%

hpc pbs pbs-script pbspro reproducibility science science-research simulation

experi's People

Contributors

Stargazers

Watchers

Forkers

tomnicholas

experi's Issues

Clashes between default options of different schedulers

The initialisation of the SchedulerOptions class sets some default options which turn out to be incompatible with the SLURM scheduler. For example N means number of nodes in Slurm, but experi currently thinks it always means job name.

This is an example of a more general problem: job schedulers just have valid and invalid options, so defaulting to the option another scheduler (or the shell) uses doesn't make sense. If SLURMOptions doesn't recognise a command then it should either throw an error immediately or include it without question as an arbitrary key. What it should not do is default to the option used by another scheduler, because that's probably wrong too, and the end result is an error which is harder to debug.

This makes me wonder if inheritance (of SLURMOptions from SchedulerOptions) is actually the right paradigm to be using here? It would be better to have some structure which allows new users to easily plug in the options for their particular scheduler, and not have to worry about whether they have correctly overwritten all the default behaviour. Or just do away with defaults completely and make people specify every batch option they want in the .yml file via the arbitrary keys method... (I actually think that's a good idea for reproducibility purposes too)

Also for reference this is the set of commands used by the machine I'm trying to run my jobs on, and here is a helpful "Rosetta Stone for job schedulers" to compare against.

Latest ruamel.yaml breaks things

The types returned by the latest ruamel.yaml are breaking assertions within the codebase.

Dependent jobs not deleting on failure

When the first job fails, all subsequent jobs should be removed from the scheduler. I believe this
is working on Quartz, however not on Artemis. I don't know if this is an issue with the scheduler or
in my code.

Prefix option for all commands

Include an option for the command to have a prefix which is applied to all subsequent commands. The main reason for this is to handle mpi specifications i.e. putting mpirun -np 8 in front of all the commands.

Create Documentation on setting up an experiment.yml file

This should be something that explains how to set everything up with specific emphasis on the variables. At this point it only needs to be a markdown file on github.

assert fails with slurm option

experi/src/experi/run.py

Line 190 in b4d953b

assert scheduler in ["shell", "pbs"]

This line should be

assert scheduler in ["shell", "pbs", "slurm"]

Support for older python3 versions

Neither of the HPC systems that I currently have access to have python 3.6
installed. This is a problem for the simple installation, requiring the install
of python before even being able to test this out.

Support for Slurm

Slurm is an alternative scheduler to PBSPro which has a large userbase. Compatibility with this scheduler would be really nice.

There is a docker container available which could be used for testing and development purposes.

Warning that key does not exist in command

A more sensible warning that a variable is not defined which should come somewhere at the start of the code generation.

RFC Python versions to support

Currently Python 3.6 and 3.7 are the only versions supported, it is possible to support older 3.x versions with a little effort, but is that effort useful.

Somewhat of a blocker for Python <3.6 is #14 as this makes testing non-deterministic.

Which python version do you use and would like supported?

Quick method for removing all other files in directory

There are a number of instances I have come across where an experiment has gone
wrong for whatever reason and so I want to delete all the created files. Having
a quick command/option to delete everything except the experiment.yml file
would be nice.

Allow specification of ncpus in command

Different commands can have different requirements for the number of cpus, so
allowing for a different number for each command gives much more flexibility.

Support bash operators for jobs in current shell

I would like to make running commands the same regardless of where they are run, whether that be the in the current shell, submitting jobs to PBSPro, or submitting jobs to SLURM. Making it simple to test on one platform before running on another.

A requirement for this is to handle bash operators in some way. I am currently running the commands using the python subprocess module, although if I use that to create a bash shell, which runs the command that might work.

Running experi from the command line in a conda environment

I am working in a python3.6 conda environment, and I try to install experi by navigating to the root directory of experi and running python3 setup.py develop (which is normally the best way on conda I think). This causes experi to show up when I enter conda list, but calling $ experi on the command line returns the error

/cineca/prod/opt/compilers/python/3.6.4/none/bin/python3.6: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

This is weird because that's not the version of python that should be being used, because $ which python returns the python3.6 version in my conda environment:

/marconi_work/FUA32_SOL_BOUT/tnichola/anaconda3/envs/py36/bin/python

and $ python3 takes me into the conda python3 as it should.

Is it possible that experi is looking for python in the wrong place? Or is my environment just messed up?

Handle no commands to run with dependencies

When using the dependencies and all the commands have been run the resulting file has an empty command array which will try to run a non-existstant job.

Replace spaces in name when creating pbs file

Having spaces in the name is incompatible with pbs system so automatically get rid of them.

Walltime not being calculated or formatted correctly

The walltime in in the generated pbs file is just a integer which is not particularly human readable. I am not sure why this is, and not sure if it actually affects anything, however it should be fixed.

Add ability to create input files

Having a sequence of variables defined in an input file is a standard method of running simulations. It should be simple to create an input file and substitute in the input variables.

I thought that using redirection like > would work, it doesn't and I would need some parsing to pass it separately to the subprocess.run command. A more appropriate approach might be a file key.

Calling experi from python instead of using command-line interface

I really like this, but I don't want to have to use the command-line interface. This is partly because I'm having difficulties getting it to work with my conda environment (which I might raise a separate issue about), but also because being able to call experi from python scripts would fit into my workflow better.

Trying to directly import and run experi's main() function doesn't work because unfortunately the click library doesn't allow you to just bypass its decorators (see the discussion here), so any option to do this would have to involve some (minor) changes to experi (PR incoming).

Get working with pbs on Artemis

The pbs scripts are set up to only work on Quartz, this is somewhat restricting however, so I need a more generalised method of submitting pbs jobs.

Allow arbitrary PBS options

While the options I have are relatively tame defaults, there are many others that might be required
for any number of reasons. Rather than try and build in support for everything, the approach might
be to allow any arbitrary options.

Mock interface with system

I have come across a number of bugs ff7d78a 9d0a8ce b5b665f that have resulted from not properly testing the job submission section of the codebase.

A way to test this interface in a flexible way is to mock this interface, allowing any result I want to be returned. The specifics I am interested in is monkeypatching .

Allow multiple zip keys

With the current implementation I am unable to have two zip operations at the same level. Taking the below code as an example

variables:
  zip:
    var1: [1, 2, 3]
    var2: [2, 3, 4]
  zip:
    var3: [a, b]
    var4: [x, y]

With the two zip keys this is an impossible data structure.

A possible solution is to allow for a list like below.

variables:
  zip:
    - var1: [1, 2, 3]
      var2: [2, 3, 4]

    - var3: [a, b]
      var4: [x, y]

Alternatively the previous approach I was using didn't have this issue, although that does mean breaking things again.

I do think that the current implementation does make things a lot clearer for most cases.

Jobs leave orphans after failure

When a pbs job is deleted or fails, only the jobs which have that as a direct
dependent will also stop, however this doesn't propagate further down the
chain. Instead leaving orphaned jobs. A potential solution to this problem is
to add all preceding commands to the check of the check of completions.

Further testing of slurm support

There are currently very few tests to ensure the slurm support is working. I would like to add some more to be confident that it has the right behaviour.

Handle dependencies in setting array jobs

Currently the array jobs don't take into account that there are jobs missing when running with the --use-dependencies flag. This causes a segmentation fault when the job runs and failure.

At the very least this should create an empty job script to run.

Increment version to 0.1 and release

Increment the version and release to PyPI for simple installation.

Deterministic ordering in python < 3.6

Python 3.6 introduced a new dictionary implementation which retains order of insertion. This means
that the ordering of the dictionary and consequently the ordering of variable iteration is
deterministic.

In versions prior to this the ordering of variable iteration can vary between instances run.
This manifests itself as variation in ordering of commands, the difference between

echo 1 1
echo 1 2
echo 1 3
echo 2 1
echo 2 2
echo 2 3

and

echo 1 1
echo 2 1
echo 1 2
echo 2 2
echo 1 3
echo 2 3

This primarily poses an issue with testing, although it would be nice to remove random elements.

Include some sample experiment.yml files

Give an indication of how these files can be used.

Create README

Create a README with instructions for installation an general use.

Allow commands to be run on failure

Include logic for dealing with failure. Commands that run no matter what, or if there is a failure, or only on complete success. This can be implemented with the logic of the scheduler and then custom logic for the bash side.

Include support for more pbs options

This means being able to specify things like the output and error files, when to email warnings and other configuration options available to pbs.

Create Docker container for testing Slurm

A docker container in which the testing for the scheduler can be run. This also means I don't need a dedicated system with slurm for development.

Time being converted to seconds incompatible with SLURM

I specify walltime: 8:00:00 in my .yml file, but the .time["walltime"] attribute of the instance of SLURMoptions gets set to 28800 (the number of seconds in 8 hours). This is causing the SLURM batch scheduler to throw an error.

I can't for the life of me find which bit of experi's code is doing this conversion from hours:minutes:seconds to seconds, so for now I have had to just hack it.

Problems running job with multiple commands (on SLURM?)

I have managed to get jobs to submit on SLURM, but they failed once they started with the very cryptic error that " mkdir has no option 's' ".

My .yml file looks like

jobs:
  - command:
    - mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_{viscosity}_vortloss_{vorticity_loss}
    - cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_{viscosity}_vortloss_{vorticity_loss}

variables:
    product:
	viscosity:
            - 1e1
            - 3.7e1
            - 1e0
            - 3.7e0
        vorticity_loss:
            - 1e0
            - 3.7e0
            - 1e1
            - 3.7e1

slurm:
    job-name: core_phi_visc_scan
    nodes: 1
    tasks-per-node: 48
    walltime: 8:00:00
    p: skl_fua_prod
    A: FUA32_SOL_BOUT
    mail-user: [email protected]
    mail-type: END,FAIL
    error: /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.err
    output: /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue
    setup:
	- export OMP_NUM_THREADS=1

which with my altered version of the code (supplied in this PR) produces the batch script file experi_00.slurm which contains:

#!/bin/bash
#SBATCH --job-name core_phi_visc_scan
#SBATCH --time 8:00:00
#SBATCH --output /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.out
#SBATCH --nodes 1
#SBATCH --tasks-per-node 48
#SBATCH -p skl_fua_prod
#SBATCH -A FUA32_SOL_BOUT
#SBATCH --mail-user [email protected]
#SBATCH --mail-type END,FAIL
#SBATCH --error /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.err
#SBATCH --array 0-15

cd "$SLURM_SUBMIT_DIR"
export OMP_NUM_THREADS=1

COMMAND=( \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e1" \
)

${COMMAND[$SLURM_ARRAY_TASK_ID]}

I can't see anything wrong with this, but it's still failing. I don't know a lot about bash scripting (I always try to use python as much as possible instead!), but I think it has something to do with creating a bash array whose elements are multiple chained commands, i.e. at the command prompt entering

COMMAND=( "echo 0 && echo 1" "echo 2 && echo 3"	 )

${COMMAND[1]}

This example is in the format experi produces, but prints 2 && echo 3, which is clearly not the desired output.

System specific configuration files

Allow for system specific configurations for simulation runs.

Create dry run flag

Create a flag which doesn't execute any code on the host, instead telling the user what code would have been executed.

Support file creation in command

Allow for the creation of a file using a simple interface, something like

file:
  - name:
  - contents:
  - template:

which would allow for the specification of the file content either inline or alternatively using
a template file.

Job arrays or job scripts?

Currently experi uses the job array functionality to submit multiple jobs, but I want to question whether this is the best thing to do.

Creating a single batch script specifying a job array means that you don't have access to the {variables} in any of the scheduler options, so you can't vary (for example) processor number (and hence potentially resolution), output or err files.

If instead experi created a template for a single non-array job, then copied specific instances of this template into individual run directories, then it could be much more flexible. This would also be better for reproducibility, because the exact options used would be stored in the directory containing the output. It would help when one job needs to be re-run or restarted (which can easily happen due to numerical instabilities), because the job file for that simulation would be stand-alone.

Wrapping all the commands in a bash array also makes debugging harder, as in #65.

Finally it seems like some job schedulers don't have an option to submit arrays (see LoadLeveler here), which would mean they can never be supported by experi.

I suppose that some of the advantages of using the job array system are the ability to control the jobs together using commands like scancel (or whatever the PBS equivalents are), but I'm not sure this outweighs the disadvantages of using the very limited array system which PBS and SLURM currently have implemented.

Handle logging in pbs scripts

Have a proper method of handling the output from the scheduler. Currently I have to manually create a log directory, which I tend not to putting all the logs into a single file.

The default should be to log to logs with both the output and error going to the same file. The output and error to the same file is the same behaviour as for the shell, so is not unexpected.

Error when using SLURM with dry_run=True

experi/src/experi/run.py

Line 542 in 5d1c1a7

prev_jobids.append(cmd_res.decode().strip())

This line causes an error when dry_run=True because cmd_res has not been defined.

To fix this the line should be deleted.

Check valid commands at start of run

Failure should occur as fast as possible, so the best method to ensure this is to have as many points of failure as early as possible. In particular user errors like having a valid set of commands, and having all the variables required for the commands to run.

Choose whether to run commands as separate or same job

The current workflow is that each job is completely separate, i.e. all of A finishes before any of B starts. However an alternate and completely valid workflow is to want to run A then B for all variables without the requirement of all A finishing first.

Basically have some way of having multiple commands in a single pbs script.

Include bump version as a dependency

I would like to use bumpversion as the method of incrementing the version number.

Issue with trailing newlines in bash arrays

The trailing newlines are causing troubles with the bash arrays

Create command class to generate the command to run

Currently the command is just a series of strings, however there is a lot of additional functionality I would like to add on to the command, and I think the most appropriate method of doing that is as a class.

Allow for definition of variables as a formula

It would be really useful to be able to define a variable as a function of another. Mostly simple functions like addition/multiplication. The use case I would like to use if for is the definition of the keyframe distance as a function of the total number of steps.