@yuvipanda
Thanks for all your work on kubespawner. I've started experimenting with running jupyterhub on kubernetes, largely thanks to this spawner, but I wanted to get some guidance around my use-cases / workflow from someone a bit more seasoned in this technology. I'm structuring these as a series of high-level questions, where your input would be be much appreciated. For ease of explanation, I may refer to the rough sketch below lower down.
My efforts so far, for context:
I was working through the data-8/jupyterhub-k8s implementation, which I think bases itself off your work, since it's structure in a chart form (fro helm) is the easiest to work with, compared to some of the other implementations I've found out there.
I modified that set-up slightly to handle gitlab authentication (rather than google), which worked OK, but I wasn't able to get the spawning of their large user image (>5GB), based on this Dockerfile and their hub image to work. It was constantly stuck in a Waiting: ContainerCreating
state and would then try to re-spawn itself. I haven't figured out what the problem is, but there appears to be plenty of space on the cluster. I'm using v1.51 of kubernetes on GCE.
Anyway, I ended up getting things working using instead the hub image (dockerfile below), a variation of the data-8 one, in conjunction with your yuvipanda/simple-singleuser:v1 user image.
FROM jupyterhub/jupyterhub-onbuild:0.7.1
# Install kubespawner and its dependencies
RUN /opt/conda/bin/pip install \
oauthenticator==0.5.* \
git+https://github.com/derrickmar/kubespawner \
git+https://github.com/yuvipanda/jupyterhub-nginx-chp.git
ADD jupyterhub_config.py /srv/jupyterhub_config.py
ADD userlist /srv/userlist
WORKDIR /srv/jupyterhub
EXPOSE 8081
CMD jupyterhub --config /srv/jupyterhub_config.py --no-ssl
This was able to spawn new user persistent volumes, bind them to PVCs and obviously spawn user jupyter notebook servers, which could be stopped/started and re-use the same PV. My initial tests as to whether new files/notebooks were getting persisted on the PV were failing, since I wasn't saving them under /home
, which is where the binding to the volume is happening.
i. user management / userid - After various aborted attempts to get the larger data-8 user image working, and where user PVs weren't deleted. I noticed that the userid
appended to username for naming the PV incremented up, but it wasn't clear where this numbering logic was coming from, as it wasn't a env variable in any of the manifests. Is this some fail-safe of some sort?
Currently, I'm using a whitelist userlist
for users (see code from jupyterhub_config.py) below, and these correspond with my users' gitlab logins that I'm authenticating against. However, it's probably not a clean solution. I see you are working on another approach on the fsgroup and just wanted to get a better understanding around the context of this solution?
# Whitlelist users and admins
c.Authenticator.whitelist = whitelist = set()
c.Authenticator.admin_users = admin = set()
c.JupyterHub.admin_access = True
pwd = os.path.dirname(__file__)
with open(os.path.join(pwd, 'userlist')) as f:
for line in f:
if not line:
continue
parts = line.split()
name = parts[0]
whitelist.add(name)
if len(parts) > 1 and parts[1] == 'admin':
admin.add(name)
ii. possibility for interchangeable images - I find the current default set-up with Jupyterhub allowing for spawning a single image very limiting. I can see from #14 that you are considering extending functionality in the kubespawner to allow for an image to be selected. @minrk was able to confirm over here that it could be possible to pass this image selection programmatically via the jupyterhub API, although I'm not sure, as per this issue, as to whether the hub API will work in a kubernetes context.
You pointed to an implementation by Google here. It's not clear to me where they are deriving their list of available images. How do you think something like this should work?
As per the sketch up top, I'm looking to handle a set-up where users have various private/shared repos (marked 1 above in sketch), from which docker images are generated and stored in a registry (2 above). Then my users (3 above) would be able to spawn a compute environment for their chosen repo and have it spawned in kubernetes (4 above), with the possibility, from 5 above, to have the repo cloned (maybe leveraging gitRepo) and for any incrimental work performed on it, while on the notebook server, persisted (6).
iii. multiple simultaneous servers per user based on different images - As far as I understand, it's not possible with jupyterhub to presently allow a user to have multiples instances of a notebook server, each running a different image? Do the tools exist within kubernetes to potentially facilitate this? Thinking out loud, could this be facilitated by having multiple smaller persistent volumes for a user, based on the repo from which the server image is derived? Or maybe this could be achieved within a single PV, by using the subPath functionality?
c.KubeSpawner.volumes = [
{
'name': 'volume-{username}-{repo-namespace}-{repo-name}',
'persistentVolumeClaim': {
'claimName': 'claim-{username}-{repo-namespace}-{repo-name}'
}
}
]
iv. ideas around version-control - Given the various advantages derived from using kubernetes to host jupyter, I would be curious if you had some thoughts around whether kubernetes also potentially makes it easier to manage version control for notebooks and other files created while in a user works in a notebook server environment. Perhaps something like preStop hooks could be used to commit and push changes prior to a container shutting down.
Even facilitating a user to be able to run git commands from a notebook server terminal .. and have SSH keys back to the version-control system handled via the kubernetes secrets/config maps might be a start. Have you seen any implementations solving this?
Thanks for your patience in reading through this!