Giter Site home page Giter Site logo

malramsay64 / experi Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 1.0 307 KB

An interface for managing computational experiments with many independent variables.

License: MIT License

Python 100.00%
hpc pbs pbs-script pbspro reproducibility science science-research simulation

experi's People

Contributors

dependabot-preview[bot] avatar malramsay64 avatar tomnicholas avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

tomnicholas

experi's Issues

Clashes between default options of different schedulers

The initialisation of the SchedulerOptions class sets some default options which turn out to be incompatible with the SLURM scheduler. For example N means number of nodes in Slurm, but experi currently thinks it always means job name.

This is an example of a more general problem: job schedulers just have valid and invalid options, so defaulting to the option another scheduler (or the shell) uses doesn't make sense. If SLURMOptions doesn't recognise a command then it should either throw an error immediately or include it without question as an arbitrary key. What it should not do is default to the option used by another scheduler, because that's probably wrong too, and the end result is an error which is harder to debug.

This makes me wonder if inheritance (of SLURMOptions from SchedulerOptions) is actually the right paradigm to be using here? It would be better to have some structure which allows new users to easily plug in the options for their particular scheduler, and not have to worry about whether they have correctly overwritten all the default behaviour. Or just do away with defaults completely and make people specify every batch option they want in the .yml file via the arbitrary keys method... (I actually think that's a good idea for reproducibility purposes too)

Also for reference this is the set of commands used by the machine I'm trying to run my jobs on, and here is a helpful "Rosetta Stone for job schedulers" to compare against.

Dependent jobs not deleting on failure

When the first job fails, all subsequent jobs should be removed from the scheduler. I believe this
is working on Quartz, however not on Artemis. I don't know if this is an issue with the scheduler or
in my code.

Prefix option for all commands

Include an option for the command to have a prefix which is applied to all subsequent commands. The main reason for this is to handle mpi specifications i.e. putting mpirun -np 8 in front of all the commands.

Support for older python3 versions

Neither of the HPC systems that I currently have access to have python 3.6
installed. This is a problem for the simple installation, requiring the install
of python before even being able to test this out.

Support for Slurm

Slurm is an alternative scheduler to PBSPro which has a large userbase. Compatibility with this scheduler would be really nice.

There is a docker container available which could be used for testing and development purposes.

RFC Python versions to support

Currently Python 3.6 and 3.7 are the only versions supported, it is possible to support older 3.x versions with a little effort, but is that effort useful.

Somewhat of a blocker for Python <3.6 is #14 as this makes testing non-deterministic.

Which python version do you use and would like supported?

Quick method for removing all other files in directory

There are a number of instances I have come across where an experiment has gone
wrong for whatever reason and so I want to delete all the created files. Having
a quick command/option to delete everything except the experiment.yml file
would be nice.

Allow specification of ncpus in command

Different commands can have different requirements for the number of cpus, so
allowing for a different number for each command gives much more flexibility.

Support bash operators for jobs in current shell

I would like to make running commands the same regardless of where they are run, whether that be the in the current shell, submitting jobs to PBSPro, or submitting jobs to SLURM. Making it simple to test on one platform before running on another.

A requirement for this is to handle bash operators in some way. I am currently running the commands using the python subprocess module, although if I use that to create a bash shell, which runs the command that might work.

Running experi from the command line in a conda environment

I am working in a python3.6 conda environment, and I try to install experi by navigating to the root directory of experi and running python3 setup.py develop (which is normally the best way on conda I think). This causes experi to show up when I enter conda list, but calling $ experi on the command line returns the error

/cineca/prod/opt/compilers/python/3.6.4/none/bin/python3.6: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

This is weird because that's not the version of python that should be being used, because $ which python returns the python3.6 version in my conda environment:

/marconi_work/FUA32_SOL_BOUT/tnichola/anaconda3/envs/py36/bin/python

and $ python3 takes me into the conda python3 as it should.

Is it possible that experi is looking for python in the wrong place? Or is my environment just messed up?

Add ability to create input files

Having a sequence of variables defined in an input file is a standard method of running simulations. It should be simple to create an input file and substitute in the input variables.

I thought that using redirection like > would work, it doesn't and I would need some parsing to pass it separately to the subprocess.run command. A more appropriate approach might be a file key.

Calling experi from python instead of using command-line interface

I really like this, but I don't want to have to use the command-line interface. This is partly because I'm having difficulties getting it to work with my conda environment (which I might raise a separate issue about), but also because being able to call experi from python scripts would fit into my workflow better.

Trying to directly import and run experi's main() function doesn't work because unfortunately the click library doesn't allow you to just bypass its decorators (see the discussion here), so any option to do this would have to involve some (minor) changes to experi (PR incoming).

Get working with pbs on Artemis

The pbs scripts are set up to only work on Quartz, this is somewhat restricting however, so I need a more generalised method of submitting pbs jobs.

Allow arbitrary PBS options

While the options I have are relatively tame defaults, there are many others that might be required
for any number of reasons. Rather than try and build in support for everything, the approach might
be to allow any arbitrary options.

Mock interface with system

I have come across a number of bugs ff7d78a 9d0a8ce b5b665f that have resulted from not properly testing the job submission section of the codebase.

A way to test this interface in a flexible way is to mock this interface, allowing any result I want to be returned. The specifics I am interested in is monkeypatching .

Allow multiple zip keys

With the current implementation I am unable to have two zip operations at the same level. Taking the below code as an example

variables:
  zip:
    var1: [1, 2, 3]
    var2: [2, 3, 4]
  zip:
    var3: [a, b]
    var4: [x, y]

With the two zip keys this is an impossible data structure.

A possible solution is to allow for a list like below.

variables:
  zip:
    - var1: [1, 2, 3]
      var2: [2, 3, 4]

    - var3: [a, b]
      var4: [x, y]

Alternatively the previous approach I was using didn't have this issue, although that does mean breaking things again.

I do think that the current implementation does make things a lot clearer for most cases.

Jobs leave orphans after failure

When a pbs job is deleted or fails, only the jobs which have that as a direct
dependent will also stop, however this doesn't propagate further down the
chain. Instead leaving orphaned jobs. A potential solution to this problem is
to add all preceding commands to the check of the check of completions.

Further testing of slurm support

There are currently very few tests to ensure the slurm support is working. I would like to add some more to be confident that it has the right behaviour.

Handle dependencies in setting array jobs

Currently the array jobs don't take into account that there are jobs missing when running with the --use-dependencies flag. This causes a segmentation fault when the job runs and failure.

At the very least this should create an empty job script to run.

Deterministic ordering in python < 3.6

Python 3.6 introduced a new dictionary implementation which retains order of insertion. This means
that the ordering of the dictionary and consequently the ordering of variable iteration is
deterministic.

In versions prior to this the ordering of variable iteration can vary between instances run.
This manifests itself as variation in ordering of commands, the difference between

echo 1 1
echo 1 2
echo 1 3
echo 2 1
echo 2 2
echo 2 3

and

echo 1 1
echo 2 1
echo 1 2
echo 2 2
echo 1 3
echo 2 3

This primarily poses an issue with testing, although it would be nice to remove random elements.

Create README

Create a README with instructions for installation an general use.

Allow commands to be run on failure

Include logic for dealing with failure. Commands that run no matter what, or if there is a failure, or only on complete success. This can be implemented with the logic of the scheduler and then custom logic for the bash side.

Include support for more pbs options

This means being able to specify things like the output and error files, when to email warnings and other configuration options available to pbs.

Time being converted to seconds incompatible with SLURM

I specify walltime: 8:00:00 in my .yml file, but the .time["walltime"] attribute of the instance of SLURMoptions gets set to 28800 (the number of seconds in 8 hours). This is causing the SLURM batch scheduler to throw an error.

I can't for the life of me find which bit of experi's code is doing this conversion from hours:minutes:seconds to seconds, so for now I have had to just hack it.

Problems running job with multiple commands (on SLURM?)

I have managed to get jobs to submit on SLURM, but they failed once they started with the very cryptic error that " mkdir has no option 's' ".

My .yml file looks like

jobs:
  - command:
    - mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_{viscosity}_vortloss_{vorticity_loss}
    - cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_{viscosity}_vortloss_{vorticity_loss}

variables:
    product:
	viscosity:
            - 1e1
            - 3.7e1
            - 1e0
            - 3.7e0
        vorticity_loss:
            - 1e0
            - 3.7e0
            - 1e1
            - 3.7e1

slurm:
    job-name: core_phi_visc_scan
    nodes: 1
    tasks-per-node: 48
    walltime: 8:00:00
    p: skl_fua_prod
    A: FUA32_SOL_BOUT
    mail-user: [email protected]
    mail-type: END,FAIL
    error: /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.err
    output: /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue
    setup:
	- export OMP_NUM_THREADS=1

which with my altered version of the code (supplied in this PR) produces the batch script file experi_00.slurm which contains:

#!/bin/bash
#SBATCH --job-name core_phi_visc_scan
#SBATCH --time 8:00:00
#SBATCH --output /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.out
#SBATCH --nodes 1
#SBATCH --tasks-per-node 48
#SBATCH -p skl_fua_prod
#SBATCH -A FUA32_SOL_BOUT
#SBATCH --mail-user [email protected]
#SBATCH --mail-type END,FAIL
#SBATCH --error /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.err
#SBATCH --array 0-15

cd "$SLURM_SUBMIT_DIR"
export OMP_NUM_THREADS=1

COMMAND=( \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e1" \
)

${COMMAND[$SLURM_ARRAY_TASK_ID]}

I can't see anything wrong with this, but it's still failing. I don't know a lot about bash scripting (I always try to use python as much as possible instead!), but I think it has something to do with creating a bash array whose elements are multiple chained commands, i.e. at the command prompt entering

COMMAND=( "echo 0 && echo 1" "echo 2 && echo 3"	 )

${COMMAND[1]}

This example is in the format experi produces, but prints 2 && echo 3, which is clearly not the desired output.

Create dry run flag

Create a flag which doesn't execute any code on the host, instead telling the user what code would have been executed.

Support file creation in command

Allow for the creation of a file using a simple interface, something like

file:
  - name:
  - contents:
  - template:

which would allow for the specification of the file content either inline or alternatively using
a template file.

Job arrays or job scripts?

Currently experi uses the job array functionality to submit multiple jobs, but I want to question whether this is the best thing to do.

Creating a single batch script specifying a job array means that you don't have access to the {variables} in any of the scheduler options, so you can't vary (for example) processor number (and hence potentially resolution), output or err files.

If instead experi created a template for a single non-array job, then copied specific instances of this template into individual run directories, then it could be much more flexible. This would also be better for reproducibility, because the exact options used would be stored in the directory containing the output. It would help when one job needs to be re-run or restarted (which can easily happen due to numerical instabilities), because the job file for that simulation would be stand-alone.

Wrapping all the commands in a bash array also makes debugging harder, as in #65.

Finally it seems like some job schedulers don't have an option to submit arrays (see LoadLeveler here), which would mean they can never be supported by experi.

I suppose that some of the advantages of using the job array system are the ability to control the jobs together using commands like scancel (or whatever the PBS equivalents are), but I'm not sure this outweighs the disadvantages of using the very limited array system which PBS and SLURM currently have implemented.

Handle logging in pbs scripts

Have a proper method of handling the output from the scheduler. Currently I have to manually create a log directory, which I tend not to putting all the logs into a single file.

The default should be to log to logs with both the output and error going to the same file. The output and error to the same file is the same behaviour as for the shell, so is not unexpected.

Check valid commands at start of run

Failure should occur as fast as possible, so the best method to ensure this is to have as many points of failure as early as possible. In particular user errors like having a valid set of commands, and having all the variables required for the commands to run.

Choose whether to run commands as separate or same job

The current workflow is that each job is completely separate, i.e. all of A finishes before any of B starts. However an alternate and completely valid workflow is to want to run A then B for all variables without the requirement of all A finishing first.

Basically have some way of having multiple commands in a single pbs script.

Create command class to generate the command to run

Currently the command is just a series of strings, however there is a lot of additional functionality I would like to add on to the command, and I think the most appropriate method of doing that is as a class.

Allow for definition of variables as a formula

It would be really useful to be able to define a variable as a function of another. Mostly simple functions like addition/multiplication. The use case I would like to use if for is the definition of the keyframe distance as a function of the total number of steps.

Handle multiple zip iterations

Support applying a zip for the same number of steps to multiple series at once. Apply the same number of steps to two sequences of temperatures for different values of a pressure.

Create command dependencies and status of completion

For each command I would like the ability to have files that are required for the command to run, as well as files that the command creates. At this stage this would mainly be as a method of not rerunning commands where possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.