..is a waste of resources. Maybe a two-step process is needed, like the one we had bef

Blocking 12+ CPU's for run setup about remind HOT 4 CLOSED

remindmodel commented on July 29, 2024

Blocking 12+ CPU's for run setup

from remind.

Comments (4)

dklein-pik commented on July 29, 2024

In the process we had before, the setup of the run was done sequentially on the login node, (potentially) taking quite a long time to start the last run (if a couple of runs were started). By integrating the run setup into the individual slurm jobs we intended to make the starting of runs faster for the user and parallelize those parts of the script that can be parallelized. However, you are right, for the short time of the individual run setup we now block more CPUs than required. To find out how severe this is, I looked into my latest coupled runs (n=177). Here are the mean durations:

  type  section tot           
  <ord> <chr>   <drtn>        
1 rem   GAMS    67.272571 mins
2 rem   output   7.906585 mins
3 rem   prep     2.562874 mins

It seems that the mean preparation time (2.56) is not very long compared to the mean GAMS runtime (67.27). However, with all the runs we blocked 12 CPUs 177 times for 2.56 minutes, which is something. By splitting the process of starting runs up again into two parts (a slurm job preparing all the runs, and the actual parallel runs), we could save cluster resources, but lose time due to the sequential procedure of preparing the runs.

from remind.

giannou commented on July 29, 2024

I don't get what you mean with "sequential procedure". Can you specify? The login nodes on the cluster have around 100 CPU's available. I'm sure they can host our model preparation jobs, so no need to send them to the compute nodes. But if we have to send them to the compute nodes then If we had one SLURM job on one CPU preparing the run (no need to prompt the users here) and then a second one (specified by the user, for the GAMS part and the reporting) we would save resources, no?

from remind.

dklein-pik commented on July 29, 2024

With "sequential procedure" I mean all the parts in the preparation scripts that are outside the "lock" block. Inside this block are mainly the NDC calculations and the singleGAMSfile(), which is, I admit, probably the heaviest part. This part can not be parallelized, meaning the runs wait for each other to go through this bottleneck. All the rest can take place in parallel. Even if the parts mentioned above are executed sequentially, the starting of runs from a users point of view is faster now, because he does not need to wait for the starting script to loop over all runs (including the lame singleGAMSfile part). All runs are send to the cluster immediately and the user is done with starting the runs.

If we had one SLURM job on one CPU preparing the run (no need to prompt the users here) and then a second one (specified by the user, for the GAMS part and the reporting) we would save resources, no?

That's right. This would have a similar positive effect for the user mentioned above. Needs some rework of the starting scripts. I can take care of that after my first month of parental leave, which I expect to start soon.

from remind.

johanneskoch94 commented on July 29, 2024

David maybe the switch to using dependencies with the job submissions is useful for this? I've played with my setup that I showed you a while back, remember? And since you've already split the prepare_run function into 2, it was pretty straight forward to submit a prepare- and a run job with different sets of resource requirements.

from remind.

Blocking 12+ CPU's for run setup about remind HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent