Comments (4)
In the process we had before, the setup of the run was done sequentially on the login node, (potentially) taking quite a long time to start the last run (if a couple of runs were started). By integrating the run setup into the individual slurm jobs we intended to make the starting of runs faster for the user and parallelize those parts of the script that can be parallelized. However, you are right, for the short time of the individual run setup we now block more CPUs than required. To find out how severe this is, I looked into my latest coupled runs (n=177). Here are the mean durations:
type section tot
<ord> <chr> <drtn>
1 rem GAMS 67.272571 mins
2 rem output 7.906585 mins
3 rem prep 2.562874 mins
It seems that the mean preparation time (2.56) is not very long compared to the mean GAMS runtime (67.27). However, with all the runs we blocked 12 CPUs 177 times for 2.56 minutes, which is something. By splitting the process of starting runs up again into two parts (a slurm job preparing all the runs, and the actual parallel runs), we could save cluster resources, but lose time due to the sequential procedure of preparing the runs.
from remind.
I don't get what you mean with "sequential procedure". Can you specify? The login nodes on the cluster have around 100 CPU's available. I'm sure they can host our model preparation jobs, so no need to send them to the compute nodes. But if we have to send them to the compute nodes then If we had one SLURM job on one CPU preparing the run (no need to prompt the users here) and then a second one (specified by the user, for the GAMS part and the reporting) we would save resources, no?
from remind.
With "sequential procedure" I mean all the parts in the preparation scripts that are outside the "lock" block. Inside this block are mainly the NDC calculations and the singleGAMSfile(), which is, I admit, probably the heaviest part. This part can not be parallelized, meaning the runs wait for each other to go through this bottleneck. All the rest can take place in parallel. Even if the parts mentioned above are executed sequentially, the starting of runs from a users point of view is faster now, because he does not need to wait for the starting script to loop over all runs (including the lame singleGAMSfile part). All runs are send to the cluster immediately and the user is done with starting the runs.
If we had one SLURM job on one CPU preparing the run (no need to prompt the users here) and then a second one (specified by the user, for the GAMS part and the reporting) we would save resources, no?
That's right. This would have a similar positive effect for the user mentioned above. Needs some rework of the starting scripts. I can take care of that after my first month of parental leave, which I expect to start soon.
from remind.
David maybe the switch to using dependencies with the job submissions is useful for this? I've played with my setup that I showed you a while back, remember? And since you've already split the prepare_run function into 2, it was pretty straight forward to submit a prepare- and a run job with different sets of resource requirements.
from remind.
Related Issues (20)
- console output is confusing and too long when starting REMIND runs HOT 5
- Structure of main.gms (rather lack thereof) HOT 2
- inconsistencies in AR6 summation HOT 17
- Move ARIADNE adjustments to input data files
- Redo industry production fixing HOT 1
- extramapping breaks REMIND EU21 runs
- Section on "Local Input Data" in "Changing inputs in REMIND model" needs updating
- output.R reselect output directories fails HOT 3
- comparescenarios2 fails with latex compilation error HOT 6
- (pretty much) all REMIND policy scenarios are misconfigured since three years HOT 4
- Building the model (via `gms::singleGAMSfile`) is too slow HOT 2
- toolReportEDGET introduces duplicated variables HOT 8
- Add agwaste emissions from MAgPIE to the coupling
- exoGAINS overwrites fulldata.gdx in the middle of an iteration
- consolidate R script functions HOT 4
- OK to remove sefe and fete? HOT 11
- createRDS() does not exits anymore HOT 3
- Calibration Woe
- Time to say goodbye to END2110 HOT 6
- after rerunning REMIND reporting, MAgPIE results are missing HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from remind.