Giter Site home page Giter Site logo

Comments (6)

mbmilligan avatar mbmilligan commented on June 2, 2024

So right now, LocalProcessSpawner just sends SIGINT followed by SIGTERM. My reading of the code is that the signal handlers for the singleuser notebook server do the same thing for both of those. It's not totally clear to me how shutdown_kernel gets called, because it looks to me like those signal handlers just shut down the ioloop. Regardless, I think the signal used to shutdown job processes is going to be an implementation detail of the site job manager -- other than adding command line parameters to qdel/scancel/etc there's nothing we can do from code to change that.

That said, the notebook app is supposed to have an /api/shutdown web api handler that would actually do what you want in all cases. It doesn't look to me like LocalProcessSpawner uses that, but invoking that would seem to be the actually correct approach in all cases.

Pinging @minrk - is there a particular reason that the Spawner classes don't call the /api/shutdown notebook server api when Jupyterhub wants to shut down a user's notebook server?

from batchspawner.

minrk avatar minrk commented on June 2, 2024

shutdown_kernel is called during cleanup, which is called after the IOLoop exits.

/api/shutdown is new in notebook 5.1.0, so you can use that only as long as you require the latest notebook release.

from batchspawner.

rkdarst avatar rkdarst commented on June 2, 2024

This may be a bit of a workaround, but may be a better solution. What about having slurm send SIGINT first before canceling? Big hack is: batch_cancel_cmd = Unicode('sudo -E -u {username} scancel -s INT {job_id} ; sleep 5 ; sudo -E -u {username} scancel {job_id}')

Slurm seems to send TERM, wait 30 seconds, then KILL, IF TERM is trapped. The option --full would make it send to the jupyterhub process as well. Overall we still need some testing, which I can do...

A bigger question is if these signals should be more integrated into core batchspawner (separate "interrupt" command which can be called to send SIGINT)?. Would also be relevant for future progress bars.

What's the best strategy here?

from batchspawner.

rkdarst avatar rkdarst commented on June 2, 2024

Reading the scancel manual page, it seems that we should just give the --full option, which will make the initial SIGTERM to all processes, not only the initial batch shell. But... that is apparently not enough, the jupyterhub-singleuser command also has to be run inside of srun so that it is a "job step". By doing these two things (--full and srun), I get this:

...
[I 2018-04-18 21:49:49.232 SingleUserNotebookApp log:134] 200 GET /dev/user/darstr1/api/config/common?_=1524077386523 (darstr1@::ffff:127.0.0.1) 2.38ms
[I 2018-04-18 21:49:49.605 SingleUserNotebookApp log:134] 200 GET /dev/user/darstr1/nbextensions/nbextensions_configurator/list?_=1524077386524 (darstr1@::ffff:127.0.0.1) 242.59ms
slurmstepd: error: *** JOB 30198090 ON pe1 CANCELLED AT 2018-04-18T21:49:52 ***
slurmstepd: error: *** STEP 30198090.0 ON pe1 CANCELLED AT 2018-04-18T21:49:52 ***
[C 2018-04-18 21:49:52.801 SingleUserNotebookApp notebookapp:1402] received signal 15, stopping
[I 2018-04-18 21:49:52.802 SingleUserNotebookApp notebookapp:1522] Shutting down 0 kernels
<EOF>

(ordering likely caused by buffering of outputs)

Before I found this, I did an initial implementation of a separate "interrupt" command that could be run before to send SIGTERM explicitly. Still could be useful for other queue managers.

I'm not entirely confident in adding srun because I may have had some problems with it before. I will ask our local slurm guru soon and see what the answer should be.

(There are also various possibilities of bash catching the signal and relaying it to the process itself... but that seems overly involved if there is a more direct way)

from batchspawner.

rkdarst avatar rkdarst commented on June 2, 2024

But before I go too far - do we agree that SIGTERM first is the way to handle graceful shutdowns of servers? Is there any reason to skip this and rely on /api/shutdown in the future? @minrk or whoever else...

from batchspawner.

rkdarst avatar rkdarst commented on June 2, 2024

I asked our local slurm guru: he thought that the proper way was --full option on scancel (add to batchspawner slurm cancel command) and srun added as a wrapper to {cmd} / jupyterhub-singleuser. Then, the notebook gets SIGTERM first, then SIGKILL. It seems like it should be applicable to all slurm deployments, but the KillWait local option tells you how much time you have.

The above worked for me on my dev instance - would you like to try?

The other option is trapping the SIGTERM in the shell and killing the child process yourself in the trap function. Or the new notebook 5.1 shutdown.

from batchspawner.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.