Comments (6)
So right now, LocalProcessSpawner
just sends SIGINT followed by SIGTERM. My reading of the code is that the signal handlers for the singleuser notebook server do the same thing for both of those. It's not totally clear to me how shutdown_kernel
gets called, because it looks to me like those signal handlers just shut down the ioloop. Regardless, I think the signal used to shutdown job processes is going to be an implementation detail of the site job manager -- other than adding command line parameters to qdel/scancel/etc there's nothing we can do from code to change that.
That said, the notebook app is supposed to have an /api/shutdown
web api handler that would actually do what you want in all cases. It doesn't look to me like LocalProcessSpawner
uses that, but invoking that would seem to be the actually correct approach in all cases.
Pinging @minrk - is there a particular reason that the Spawner
classes don't call the /api/shutdown
notebook server api when Jupyterhub wants to shut down a user's notebook server?
from batchspawner.
shutdown_kernel
is called during cleanup, which is called after the IOLoop exits.
/api/shutdown
is new in notebook 5.1.0, so you can use that only as long as you require the latest notebook release.
from batchspawner.
This may be a bit of a workaround, but may be a better solution. What about having slurm send SIGINT
first before canceling? Big hack is: batch_cancel_cmd = Unicode('sudo -E -u {username} scancel -s INT {job_id} ; sleep 5 ; sudo -E -u {username} scancel {job_id}')
Slurm seems to send TERM
, wait 30 seconds, then KILL
, IF TERM
is trapped. The option --full
would make it send to the jupyterhub process as well. Overall we still need some testing, which I can do...
A bigger question is if these signals should be more integrated into core batchspawner (separate "interrupt" command which can be called to send SIGINT)?. Would also be relevant for future progress bars.
What's the best strategy here?
from batchspawner.
Reading the scancel
manual page, it seems that we should just give the --full
option, which will make the initial SIGTERM
to all processes, not only the initial batch shell. But... that is apparently not enough, the jupyterhub-singleuser
command also has to be run inside of srun
so that it is a "job step". By doing these two things (--full
and srun
), I get this:
...
[I 2018-04-18 21:49:49.232 SingleUserNotebookApp log:134] 200 GET /dev/user/darstr1/api/config/common?_=1524077386523 (darstr1@::ffff:127.0.0.1) 2.38ms
[I 2018-04-18 21:49:49.605 SingleUserNotebookApp log:134] 200 GET /dev/user/darstr1/nbextensions/nbextensions_configurator/list?_=1524077386524 (darstr1@::ffff:127.0.0.1) 242.59ms
slurmstepd: error: *** JOB 30198090 ON pe1 CANCELLED AT 2018-04-18T21:49:52 ***
slurmstepd: error: *** STEP 30198090.0 ON pe1 CANCELLED AT 2018-04-18T21:49:52 ***
[C 2018-04-18 21:49:52.801 SingleUserNotebookApp notebookapp:1402] received signal 15, stopping
[I 2018-04-18 21:49:52.802 SingleUserNotebookApp notebookapp:1522] Shutting down 0 kernels
<EOF>
(ordering likely caused by buffering of outputs)
Before I found this, I did an initial implementation of a separate "interrupt" command that could be run before to send SIGTERM
explicitly. Still could be useful for other queue managers.
I'm not entirely confident in adding srun
because I may have had some problems with it before. I will ask our local slurm guru soon and see what the answer should be.
(There are also various possibilities of bash catching the signal and relaying it to the process itself... but that seems overly involved if there is a more direct way)
from batchspawner.
But before I go too far - do we agree that SIGTERM
first is the way to handle graceful shutdowns of servers? Is there any reason to skip this and rely on /api/shutdown
in the future? @minrk or whoever else...
from batchspawner.
I asked our local slurm guru: he thought that the proper way was --full
option on scancel (add to batchspawner slurm cancel command) and srun
added as a wrapper to {cmd}
/ jupyterhub-singleuser
. Then, the notebook gets SIGTERM
first, then SIGKILL
. It seems like it should be applicable to all slurm deployments, but the KillWait
local option tells you how much time you have.
The above worked for me on my dev instance - would you like to try?
The other option is trapping the SIGTERM
in the shell and killing the child process yourself in the trap function. Or the new notebook 5.1 shutdown.
from batchspawner.
Related Issues (20)
- Add a way to configure SLURM `#SBATCH --output` HOT 1
- epilogue not working HOT 3
- Use Jinja templating for batchspawner_singleuser_cmd HOT 1
- `SlurmSpawner`'s default `batch_script` fails if `jupyterhub-singleuser` is not in the `PATH`
- `SlurmSpawner` behaviors are different if installed through pip compared to installed locally HOT 4
- General maintenance needed (jh 4 not tested, tests broken, release needed) HOT 1
- Drop support for JupyterHub 0.9 and Python 3.6 HOT 1
- Rename from master to main HOT 1
- Add dependabot to bump github actions monthly
- Add RELEASE.md documentation
- Refactor to not use: asyncio_generator, tornado.gen.coroutine, tornado.gen.sleep
- Make a release to work with JupyterHub 3+ HOT 12
- Running into 404 error when redirected to /uses/<username>/tree at startup HOT 5
- Running singleuser from JupyterHUb via Slurm does not connect Hub HOT 4
- Non-configurable req_keepvars_extra HOT 2
- Allow dictionary-based customization of exec_prefix for each primary function in SlurmSpawner (and other spawners) HOT 1
- Parsing a shortened host list (2*host1) of LSF HOT 1
- JUPYTERHUB_OAUTH_SCOPES and JUPYTERHUB_OAUTH_ACCESS_SCOPES not correctly set with openpbs HOT 6
- Custom batch_submit_cmd with conditional expressions ( for gres ) HOT 2
- JupyterHub 4 deprecation warning HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from batchspawner.