malramsay64 / experi Goto Github PK
View Code? Open in Web Editor NEWAn interface for managing computational experiments with many independent variables.
License: MIT License
An interface for managing computational experiments with many independent variables.
License: MIT License
The initialisation of the SchedulerOptions
class sets some default options which turn out to be incompatible with the SLURM scheduler. For example N
means number of nodes in Slurm, but experi currently thinks it always means job name.
This is an example of a more general problem: job schedulers just have valid and invalid options, so defaulting to the option another scheduler (or the shell) uses doesn't make sense. If SLURMOptions
doesn't recognise a command then it should either throw an error immediately or include it without question as an arbitrary key. What it should not do is default to the option used by another scheduler, because that's probably wrong too, and the end result is an error which is harder to debug.
This makes me wonder if inheritance (of SLURMOptions
from SchedulerOptions
) is actually the right paradigm to be using here? It would be better to have some structure which allows new users to easily plug in the options for their particular scheduler, and not have to worry about whether they have correctly overwritten all the default behaviour. Or just do away with defaults completely and make people specify every batch option they want in the .yml
file via the arbitrary keys method... (I actually think that's a good idea for reproducibility purposes too)
Also for reference this is the set of commands used by the machine I'm trying to run my jobs on, and here is a helpful "Rosetta Stone for job schedulers" to compare against.
The types returned by the latest ruamel.yaml are breaking assertions within the codebase.
When the first job fails, all subsequent jobs should be removed from the scheduler. I believe this
is working on Quartz, however not on Artemis. I don't know if this is an issue with the scheduler or
in my code.
Include an option for the command to have a prefix which is applied to all subsequent commands. The main reason for this is to handle mpi specifications i.e. putting mpirun -np 8
in front of all the commands.
This should be something that explains how to set everything up with specific emphasis on the variables. At this point it only needs to be a markdown file on github.
Neither of the HPC systems that I currently have access to have python 3.6
installed. This is a problem for the simple installation, requiring the install
of python before even being able to test this out.
Slurm is an alternative scheduler to PBSPro which has a large userbase. Compatibility with this scheduler would be really nice.
There is a docker container available which could be used for testing and development purposes.
A more sensible warning that a variable is not defined which should come somewhere at the start of the code generation.
Currently Python 3.6 and 3.7 are the only versions supported, it is possible to support older 3.x versions with a little effort, but is that effort useful.
Somewhat of a blocker for Python <3.6 is #14 as this makes testing non-deterministic.
Which python version do you use and would like supported?
There are a number of instances I have come across where an experiment has gone
wrong for whatever reason and so I want to delete all the created files. Having
a quick command/option to delete everything except the experiment.yml file
would be nice.
Different commands can have different requirements for the number of cpus, so
allowing for a different number for each command gives much more flexibility.
I would like to make running commands the same regardless of where they are run, whether that be the in the current shell, submitting jobs to PBSPro, or submitting jobs to SLURM. Making it simple to test on one platform before running on another.
A requirement for this is to handle bash operators in some way. I am currently running the commands using the python subprocess module, although if I use that to create a bash shell, which runs the command that might work.
I am working in a python3.6 conda environment, and I try to install experi by navigating to the root directory of experi and running python3 setup.py develop
(which is normally the best way on conda I think). This causes experi to show up when I enter conda list
, but calling $ experi
on the command line returns the error
/cineca/prod/opt/compilers/python/3.6.4/none/bin/python3.6: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
This is weird because that's not the version of python that should be being used, because $ which python
returns the python3.6 version in my conda environment:
/marconi_work/FUA32_SOL_BOUT/tnichola/anaconda3/envs/py36/bin/python
and $ python3
takes me into the conda python3 as it should.
Is it possible that experi is looking for python in the wrong place? Or is my environment just messed up?
When using the dependencies and all the commands have been run the resulting file has an empty command array which will try to run a non-existstant job.
Having spaces in the name is incompatible with pbs system so automatically get rid of them.
The walltime in in the generated pbs file is just a integer which is not particularly human readable. I am not sure why this is, and not sure if it actually affects anything, however it should be fixed.
Having a sequence of variables defined in an input file is a standard method of running simulations. It should be simple to create an input file and substitute in the input variables.
I thought that using redirection like >
would work, it doesn't and I would need some parsing to pass it separately to the subprocess.run command. A more appropriate approach might be a file key.
I really like this, but I don't want to have to use the command-line interface. This is partly because I'm having difficulties getting it to work with my conda environment (which I might raise a separate issue about), but also because being able to call experi from python scripts would fit into my workflow better.
Trying to directly import and run experi's main()
function doesn't work because unfortunately the click library doesn't allow you to just bypass its decorators (see the discussion here), so any option to do this would have to involve some (minor) changes to experi (PR incoming).
The pbs scripts are set up to only work on Quartz, this is somewhat restricting however, so I need a more generalised method of submitting pbs jobs.
While the options I have are relatively tame defaults, there are many others that might be required
for any number of reasons. Rather than try and build in support for everything, the approach might
be to allow any arbitrary options.
I have come across a number of bugs ff7d78a 9d0a8ce b5b665f that have resulted from not properly testing the job submission section of the codebase.
A way to test this interface in a flexible way is to mock this interface, allowing any result I want to be returned. The specifics I am interested in is monkeypatching .
With the current implementation I am unable to have two zip operations at the same level. Taking the below code as an example
variables:
zip:
var1: [1, 2, 3]
var2: [2, 3, 4]
zip:
var3: [a, b]
var4: [x, y]
With the two zip
keys this is an impossible data structure.
A possible solution is to allow for a list like below.
variables:
zip:
- var1: [1, 2, 3]
var2: [2, 3, 4]
- var3: [a, b]
var4: [x, y]
Alternatively the previous approach I was using didn't have this issue, although that does mean breaking things again.
I do think that the current implementation does make things a lot clearer for most cases.
When a pbs job is deleted or fails, only the jobs which have that as a direct
dependent will also stop, however this doesn't propagate further down the
chain. Instead leaving orphaned jobs. A potential solution to this problem is
to add all preceding commands to the check of the check of completions.
There are currently very few tests to ensure the slurm support is working. I would like to add some more to be confident that it has the right behaviour.
Currently the array jobs don't take into account that there are jobs missing when running with the --use-dependencies flag. This causes a segmentation fault when the job runs and failure.
At the very least this should create an empty job script to run.
Increment the version and release to PyPI for simple installation.
Python 3.6 introduced a new dictionary implementation which retains order of insertion. This means
that the ordering of the dictionary and consequently the ordering of variable iteration is
deterministic.
In versions prior to this the ordering of variable iteration can vary between instances run.
This manifests itself as variation in ordering of commands, the difference between
echo 1 1
echo 1 2
echo 1 3
echo 2 1
echo 2 2
echo 2 3
and
echo 1 1
echo 2 1
echo 1 2
echo 2 2
echo 1 3
echo 2 3
This primarily poses an issue with testing, although it would be nice to remove random elements.
Give an indication of how these files can be used.
Create a README with instructions for installation an general use.
Include logic for dealing with failure. Commands that run no matter what, or if there is a failure, or only on complete success. This can be implemented with the logic of the scheduler and then custom logic for the bash side.
This means being able to specify things like the output and error files, when to email warnings and other configuration options available to pbs.
A docker container in which the testing for the scheduler can be run. This also means I don't need a dedicated system with slurm for development.
I specify walltime: 8:00:00
in my .yml
file, but the .time["walltime"]
attribute of the instance of SLURMoptions
gets set to 28800 (the number of seconds in 8 hours). This is causing the SLURM batch scheduler to throw an error.
I can't for the life of me find which bit of experi's code is doing this conversion from hours:minutes:seconds
to seconds
, so for now I have had to just hack it.
I have managed to get jobs to submit on SLURM, but they failed once they started with the very cryptic error that " mkdir has no option 's' ".
My .yml
file looks like
jobs:
- command:
- mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_{viscosity}_vortloss_{vorticity_loss}
- cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_{viscosity}_vortloss_{vorticity_loss}
variables:
product:
viscosity:
- 1e1
- 3.7e1
- 1e0
- 3.7e0
vorticity_loss:
- 1e0
- 3.7e0
- 1e1
- 3.7e1
slurm:
job-name: core_phi_visc_scan
nodes: 1
tasks-per-node: 48
walltime: 8:00:00
p: skl_fua_prod
A: FUA32_SOL_BOUT
mail-user: [email protected]
mail-type: END,FAIL
error: /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.err
output: /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue
setup:
- export OMP_NUM_THREADS=1
which with my altered version of the code (supplied in this PR) produces the batch script file experi_00.slurm
which contains:
#!/bin/bash
#SBATCH --job-name core_phi_visc_scan
#SBATCH --time 8:00:00
#SBATCH --output /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.out
#SBATCH --nodes 1
#SBATCH --tasks-per-node 48
#SBATCH -p skl_fua_prod
#SBATCH -A FUA32_SOL_BOUT
#SBATCH --mail-user [email protected]
#SBATCH --mail-type END,FAIL
#SBATCH --error /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/slurm-%A_%a.err
#SBATCH --array 0-15
cd "$SLURM_SUBMIT_DIR"
export OMP_NUM_THREADS=1
COMMAND=( \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e1_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e1_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_1e0_vortloss_3.7e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e0 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e0" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_1e1" \
"mkdir /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e1 && cd /marconi_work/FUA32_SOL_BOUT/tnichola/runs/core_phi_issue/visc_3.7e0_vortloss_3.7e1" \
)
${COMMAND[$SLURM_ARRAY_TASK_ID]}
I can't see anything wrong with this, but it's still failing. I don't know a lot about bash scripting (I always try to use python as much as possible instead!), but I think it has something to do with creating a bash array whose elements are multiple chained commands, i.e. at the command prompt entering
COMMAND=( "echo 0 && echo 1" "echo 2 && echo 3" )
${COMMAND[1]}
This example is in the format experi produces, but prints 2 && echo 3
, which is clearly not the desired output.
Allow for system specific configurations for simulation runs.
Create a flag which doesn't execute any code on the host, instead telling the user what code would have been executed.
Allow for the creation of a file using a simple interface, something like
file:
- name:
- contents:
- template:
which would allow for the specification of the file content either inline or alternatively using
a template file.
Currently experi uses the job array functionality to submit multiple jobs, but I want to question whether this is the best thing to do.
Creating a single batch script specifying a job array means that you don't have access to the {variables} in any of the scheduler options, so you can't vary (for example) processor number (and hence potentially resolution), output or err files.
If instead experi created a template for a single non-array job, then copied specific instances of this template into individual run directories, then it could be much more flexible. This would also be better for reproducibility, because the exact options used would be stored in the directory containing the output. It would help when one job needs to be re-run or restarted (which can easily happen due to numerical instabilities), because the job file for that simulation would be stand-alone.
Wrapping all the commands in a bash array also makes debugging harder, as in #65.
Finally it seems like some job schedulers don't have an option to submit arrays (see LoadLeveler here), which would mean they can never be supported by experi.
I suppose that some of the advantages of using the job array system are the ability to control the jobs together using commands like scancel
(or whatever the PBS equivalents are), but I'm not sure this outweighs the disadvantages of using the very limited array system which PBS and SLURM currently have implemented.
Have a proper method of handling the output from the scheduler. Currently I have to manually create a log directory, which I tend not to putting all the logs into a single file.
The default should be to log to logs with both the output and error going to the same file. The output and error to the same file is the same behaviour as for the shell, so is not unexpected.
Line 542 in 5d1c1a7
This line causes an error when dry_run=True
because cmd_res
has not been defined.
To fix this the line should be deleted.
Failure should occur as fast as possible, so the best method to ensure this is to have as many points of failure as early as possible. In particular user errors like having a valid set of commands, and having all the variables required for the commands to run.
The current workflow is that each job is completely separate, i.e. all of A finishes before any of B starts. However an alternate and completely valid workflow is to want to run A then B for all variables without the requirement of all A finishing first.
Basically have some way of having multiple commands in a single pbs script.
I would like to use bumpversion as the method of incrementing the version number.
The trailing newlines are causing troubles with the bash arrays
Currently the command is just a series of strings, however there is a lot of additional functionality I would like to add on to the command, and I think the most appropriate method of doing that is as a class.
It would be really useful to be able to define a variable as a function of another. Mostly simple functions like addition/multiplication. The use case I would like to use if for is the definition of the keyframe distance as a function of the total number of steps.
Ensure that dependent jobs are being specified correctly, particularly when there are multiple dependent jobs.
Support applying a zip for the same number of steps to multiple series at once. Apply the same number of steps to two sequences of temperatures for different values of a pressure.
For each command I would like the ability to have files that are required for the command to run, as well as files that the command creates. At this stage this would mainly be as a method of not rerunning commands where possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.